Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know what your definition of a "media object" is, but I'll assume:

* you have audio, video, and/or image files * each media file has a name/media_UID * your "related objects" are more or less fixed-format small records with elements of type 'string', 'uint', 'int', etc. * your "related objects" might be metadata attached to the media, info resulting from processing the media, related info from the context in which the media was found, or dynamic info about how the media was used or referenced

If money were available I would do the following:

  * buy a 1-big-table log aggregator like SenSage
    (http://www.sensage.com) (distributed Linux-based
    redundant large data storage/query engine)
  * define a single DB table with the media_UID as
    key and with all columns defined for all related
    objects (note:  I'm assuming fixed-column-set
    for each "related object") ... with the understanding
    that any given row may or may not contain a
    "related object" of a given type
  * I'd take the (relatively) static data for each
    media file (e.g., media_UID, file size, file name,
    media type, ..., # of unique faces recognized in
    the media, make-up-your-own-field-here) and
    insert it once, with NULLs for other dynamic
    "related object" fields
  * For dynamic info I'd insert media_UID and relevant
    fields and a timestamp for the dynamic event
  * ... and after this you'd have a queryable data set
    that's constantly evolving
  * you could dynamically update the schema as you need
    more "related objects" or more extend their fields
I would buy an EMC Centera array (integrated with SenSage for archival) and use it also to store the actual media, keyed by media_UID.

After you've done this you can periodically run full-table SQL/Perl scans to aggregate the info you need -- that's what the SenSage tool is built for and does it blindingly fast. You could expose the aggregations as full data sets in a Postgres DB if they're needed multiple times ... or as throwaway dynamic results if required just once.

If there's less money I'd try to replicate the data store and aggregation in Hadoop or something similar.

As for SenSage speed ... European/US/international telcos/ISPs use it to store call-data-records and IP-records, scanning billions of records in minutes when law enforcement demands the info. http://news.prnewswire.com/DisplayReleaseContent.aspx?ACCT=1...



I forgot to mention that SenSage usually is treated as a "write-only" data store, you never delete anything except when it reaches its "age-out" date (on order of 2 to 10 years usually).

This is very important for its native purpose, log data storage. For legal reasons companies that store logs related to HIPAA (health care records), SOX (public company financial records), CDR/IP-R (call-data/IP), PCI (Payment/Credit/Debit card data) need to keep records about what went on with their routers/switches, computers, servers, apps, so they can reconstruct a cross-device slice of any activities of interest for legal purposes. (It's also useful for maintaining operational integrity and proving SLAs).

In this context here of media-related storage, not throwing data away has some nice advantages in that you can always track the evolution of all interaction with the media over time. (e.g., capture media download trends ...) Then engine can store it all. Many users load 200GB+ daily of log data into the data engine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: