I don't know what your definition of a "media object" is, but I'll assume:
* you have audio, video, and/or image files
* each media file has a name/media_UID
* your "related objects" are more or less
fixed-format small records with elements of
type 'string', 'uint', 'int', etc.
* your "related objects" might be metadata
attached to the media, info resulting from
processing the media, related info from
the context in which the media was found,
or dynamic info about how the media was
used or referenced
If money were available I would do the following:
* buy a 1-big-table log aggregator like SenSage
(http://www.sensage.com) (distributed Linux-based
redundant large data storage/query engine)
* define a single DB table with the media_UID as
key and with all columns defined for all related
objects (note: I'm assuming fixed-column-set
for each "related object") ... with the understanding
that any given row may or may not contain a
"related object" of a given type
* I'd take the (relatively) static data for each
media file (e.g., media_UID, file size, file name,
media type, ..., # of unique faces recognized in
the media, make-up-your-own-field-here) and
insert it once, with NULLs for other dynamic
"related object" fields
* For dynamic info I'd insert media_UID and relevant
fields and a timestamp for the dynamic event
* ... and after this you'd have a queryable data set
that's constantly evolving
* you could dynamically update the schema as you need
more "related objects" or more extend their fields
I would buy an EMC Centera array (integrated with
SenSage for archival) and use it also to store the
actual media, keyed by media_UID.
After you've done this you can periodically run full-table SQL/Perl scans to aggregate the info you need -- that's what the SenSage tool is built for and does it blindingly fast. You could expose the aggregations as full data sets in a Postgres DB if they're needed multiple times ... or as throwaway dynamic results if required just once.
If there's less money I'd try to replicate the data store and aggregation in Hadoop or something similar.
As for SenSage speed ... European/US/international telcos/ISPs use it to store call-data-records and IP-records, scanning billions of records in minutes when law enforcement demands the info. http://news.prnewswire.com/DisplayReleaseContent.aspx?ACCT=1...
I forgot to mention that SenSage usually is treated as a "write-only" data store, you never delete anything except when it reaches its "age-out" date (on order of 2 to 10 years usually).
This is very important for its native purpose, log data storage. For legal reasons companies that store logs related to HIPAA (health care records), SOX (public company financial records), CDR/IP-R (call-data/IP), PCI (Payment/Credit/Debit card data) need to keep records about what went on with their routers/switches, computers, servers, apps, so they can reconstruct a cross-device slice of any activities of interest for legal purposes. (It's also useful for maintaining operational integrity and proving SLAs).
In this context here of media-related storage, not throwing data away has some nice advantages in that you can always track the evolution of all interaction with the media over time. (e.g., capture media download trends ...) Then engine can store it all. Many users load 200GB+ daily of log data into the data engine.
* you have audio, video, and/or image files * each media file has a name/media_UID * your "related objects" are more or less fixed-format small records with elements of type 'string', 'uint', 'int', etc. * your "related objects" might be metadata attached to the media, info resulting from processing the media, related info from the context in which the media was found, or dynamic info about how the media was used or referenced
If money were available I would do the following:
I would buy an EMC Centera array (integrated with SenSage for archival) and use it also to store the actual media, keyed by media_UID.After you've done this you can periodically run full-table SQL/Perl scans to aggregate the info you need -- that's what the SenSage tool is built for and does it blindingly fast. You could expose the aggregations as full data sets in a Postgres DB if they're needed multiple times ... or as throwaway dynamic results if required just once.
If there's less money I'd try to replicate the data store and aggregation in Hadoop or something similar.
As for SenSage speed ... European/US/international telcos/ISPs use it to store call-data-records and IP-records, scanning billions of records in minutes when law enforcement demands the info. http://news.prnewswire.com/DisplayReleaseContent.aspx?ACCT=1...