Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In your use case is a feature store roughly a collection of incrementally updated feature tables + time/version meta?

We've been storing our features in RDBMS/normalized with each feature table having a run_id column (run_id is unique to pipeline version + timestamp of execution if batch processed; if streaming there is run_id for pipeline version but the date comes from interaction date, which is part of the raw data already]) and I'm curious what we're potentially missing..

In this sense you can query features for given users generated by particular version and/or date but it does involve potentially lots of joins (to get a collection of features).



Conceptually I think that's about right (meaning, similar concepts as us, but who's to say whether this is objectively the best approach...). Practically, we're heavy on AWS, so we've found that for our size/scope/breadth, it was more performant/efficient to store the data as parquet files in S3 and cataloged in AWS Glue, and yes there are a lotttttt of joins needed, so it's worth taking time to invest in some deep thinking about, how best to partition your data and how best to optimize the type/variety/complexity/number of joins you'll have to do.

I can see RDBMS being reasonable for small-to-medium size of data, but beyond a certain threshold, I think it starts to breakdown (at the few 10s/100s of TB level, maybe?).


Yeah, we are in the 1-10 TB range. We are bound to on-prem Oracle Exadata and so far it's ok.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: