In your use case is a feature store roughly a collection of incrementally update...

strgcmc · on May 27, 2020

Conceptually I think that's about right (meaning, similar concepts as us, but who's to say whether this is objectively the best approach...). Practically, we're heavy on AWS, so we've found that for our size/scope/breadth, it was more performant/efficient to store the data as parquet files in S3 and cataloged in AWS Glue, and yes there are a lotttttt of joins needed, so it's worth taking time to invest in some deep thinking about, how best to partition your data and how best to optimize the type/variety/complexity/number of joins you'll have to do.

I can see RDBMS being reasonable for small-to-medium size of data, but beyond a certain threshold, I think it starts to breakdown (at the few 10s/100s of TB level, maybe?).

ivalm · on May 27, 2020

Yeah, we are in the 1-10 TB range. We are bound to on-prem Oracle Exadata and so far it's ok.