Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to build your own feature store for ML (logicalclocks.com)
114 points by LexSiga on May 27, 2020 | hide | past | favorite | 23 comments


This is a good idea - every ML operation should have something like this, to store, organize, version data, check for drift, do time-travel, backups/replication et cetera.

But to borrow from Steve Jobs, I think this is a feature, not a product. If you've already done the hard work of setting up a data lake or data warehouse in a cloud provider, the cloud provider can give you backups and replication, and even some time-travel. Using something like Delta Lake or even just the standard Kimball DW audit columns will get point-in-time queries. Feature versioning is just query versioning in source control, and if you have schema, you can schema version with views if you need to. If you don't have a data lake, data warehouse ... well, you'll still need to gather and clean all your data before you put it into a feature store, and that's where 90% of the work is.

I'd love to learn more, I'm sure I'm missing something, but it seems that they're re-solving the solved part - data storage and versioning. Checking for drift and data integrity is a nice bonus, but again, lots of libraries for that. I guess I could see it being beneficial for ML shops that don't have modern development practices, but if you don't have that, you have bigger problems anyways.


All your points are valid points. However, operational models (models used by online applications, for example) typically need access to lots of historical features that are not available in the application. In that case, you need to go to a low-latency database/store to get your feature values (build your feature vectors). If you want to reuse those features in different models, you will need join support for building the feature vectors, so a key-value DB won't help there. Now, your features are duplicated between this online/serving layer and the data warehouse. How do you sync them up? The other thing you're missing is time-travel queries (temporal logic for SQL in data warehouse speak). Yes, Delta Lake gives you this, but you will need to wrap that data in APIs so that your data scientists will be able to use it. For data drift, a library alone won't cut it. You need to compare descriptive statistics/distributions of the data used to train the model and the live data coming in. Where do you get those statistics from - the feature store, in our case (with the help of versioning+metadata). Then, there is end-to-end governance of ML models - what training dataset was used to train this model, can i reproduce that training dataset if it hasn't been archived? You need metadata to manage all that. So, yes you can do it - but you have to build something (as the article describes) or buy it.


As this topic will inevitably become more trendy find some some additional interesting resources on the subject as well:

- https://www.quora.com/What-are-the-implementation-challenges...

- http://featurestore.org/ (a list of -some of- the available feature stores)


A great collection of real-world case studies and various implementations can be found here: http://featurestore.org/


Similar to generating feature vectors for dataset augmentation here https://vectorspace.ai/covid19.html


I'm the author. Let me know if you have any questions.


Practitioner; not really convinced this is something i or my team needs.

Maybe I just need a really dumbed down explanation of what a “feature store-as-a-Service” is.

If we were talking about a super flexible/easy to use data catalog-as-a-Service that made it dead simple to store, version, manage & pull data from datasets, then I’d be super interested.

But a “feature store” by itself? I just don’t get it - what am I missing?


I think for small teams with a small number of models, a feature store is probably overkill. Just like you wouldn't need a data warehouse if you only have one database. However, with lots of sources of features (oltp database(s), data lake, kafka), it becomes very hard for data scientists to find/use the data they need to train models. The feature store acts like a data warehouse for features for data scientists - and the more features are reused within the organization the more value you will derive from the feature store. You wouldn't ask a Tableau user to go to S3 or Athena or Big Query to get their data for reporting, and at organizations with feature stores, their data scientists can find most of the features they need in the feature store (Uber have >20k features last i heard in their feature store). Then, there are problems related to making those features available to online applications and not duplicating feature engineering pipelines and monitoring of models that are covered elsewhere on this page and in the literature (http://www.featurstore.org).


what is exactly a "feature store"? do you store "features" by themselves? do you store whole data?

Can you give me a bit of insight on this?


You can see the Hopsworks feature store as a repository of curated features ready to be used in ML models. Or, a middle layer between data engineers and data scientists: - Data engineers write the data pipelines with the transformations and publish the features on the feature store. - Data scientists browse the feature store, pick which features they need and build the model

In the Hopsworks Feature Store we group features together in feature groups. Feature groups can be then joined to create training datasets. (You can also select a subset of features from a feature group) Training datasets are stored in a ML Framework friendly format (e.g. TFRecords if you are using TensorFlow) and you can feed them directly to your model.

If you are interested, we have a longer blog post explaining the core concepts of the Hopsworks feature store: https://www.logicalclocks.com/blog/feature-store-the-missing...


Thanks!

Also the first 10-15 minutes in this recording explains the Feature Store concept and how it can integrate with other ML tools (in this case Sagemaker) incl. slides, examples and demo if you are more of a watch & listen type. https://www.youtube.com/watch?v=3DaTA7o0FHY&list=PLgN6fhzkSu...


Is this sort of lambda architecture for ML applications? Not referring to the "library" part of the flowchart obviously.

Only major difference I can spot is the streaming layer is more about read speed latency instead of serving real time data.


There is a similarity between feature engineering pipelines that feed the feature store and the lambda architecture, in that you have two sinks for your data - one for batch applications (and creating train/test data) and one for real-time serving of features. However, there is typically only one feature engineering pipeline (to make sure features are consistent between training and serving), whereas lambda has two (one batch, one streaming).So, you could come back and say it is more like the kappa architecture, but it could be either a batch or streaming application computing the features and saving them to the feature store.


Just a heads up: There is a typo in the flow chart — “logicaclocks.com”.


Some sad soul just lost their job. Also thanks; I fixed it.


Just to add a counter-anecdote, as I see lots of (good/valid) questions about "why do I need this?", here's an anecdote about "yes we definitely benefited from this":

- Years and years ago, we already had a data warehouse (DWH)

- In the data warehouse, you would store data like, each and every order that all customers have made (i.e. up to and including full-fledged facts and dimensions about each)

- Now, let's say axiomatically/hypothetically, a very useful and highly predictive feature for ML is "# of orders made in past 7 days" for each customer

- Can this be computed from the data already in the DWH? Yes, absolutely, but it's a new computation and not an existing column/attribute in the dimensional model.

- What if you need to recompute this feature daily, for millions of customers and orders? Well, we could always just add it to the dimensional model, compute it once, and let people just use it/share it... but why? Most internal users of the DWH probably don't care about something like "# of orders past 7 days" as something to be added to a customer dimension or per-customer grain (too specific or whatever), and moreover, the DS/ML folks want the same feature but for "every 1/3/7/30/90/180/365/730/etc." days breakdown, as well as a bunch of variations about orders and things other than orders (e.g. "average time between new orders, over past 7/30/90 days" or "average $ spent over past 7/30/90 days", as features that serves as a proxy for frequency of activity and level of engagement)

- Hence, it makes sense to keep the "golden copy" of data in a canonical form in the classic/standard DWH as a baseline, and to separately/independently compute features out of that data and to store them in a different system (which can also be optimized for the different query/access patterns that DS/ML have, vs traditional BI). Over time, it also made sense in certain cases to go upstream of the DWH to source data from and process it more directly (for performance/efficiency reasons), though generally deriving features out of the transformed dimensional models was still very useful.

- It took our teams ~1-2 years to really go through this evolution and reach a mature-ish state, but for the past 2-3 years, we've benefited tremendously from having an independent feature repository/store, that is separate from the classic DWH. Benefits came in all the obvious and some non-obvious ways, i.e. in faster iteration/cycle time, in better quality/repeatability, and in being able to automatically discover interesting relationships that no human could have anticipated - simply by having a very broad/large repository of features and running automatic feature selection over it.


In your use case is a feature store roughly a collection of incrementally updated feature tables + time/version meta?

We've been storing our features in RDBMS/normalized with each feature table having a run_id column (run_id is unique to pipeline version + timestamp of execution if batch processed; if streaming there is run_id for pipeline version but the date comes from interaction date, which is part of the raw data already]) and I'm curious what we're potentially missing..

In this sense you can query features for given users generated by particular version and/or date but it does involve potentially lots of joins (to get a collection of features).


Conceptually I think that's about right (meaning, similar concepts as us, but who's to say whether this is objectively the best approach...). Practically, we're heavy on AWS, so we've found that for our size/scope/breadth, it was more performant/efficient to store the data as parquet files in S3 and cataloged in AWS Glue, and yes there are a lotttttt of joins needed, so it's worth taking time to invest in some deep thinking about, how best to partition your data and how best to optimize the type/variety/complexity/number of joins you'll have to do.

I can see RDBMS being reasonable for small-to-medium size of data, but beyond a certain threshold, I think it starts to breakdown (at the few 10s/100s of TB level, maybe?).


Yeah, we are in the 1-10 TB range. We are bound to on-prem Oracle Exadata and so far it's ok.


> What if you need to recompute this feature daily, for millions of customers and orders?...moreover, the DS/ML folks want the same feature but for "every 1/3/7/30/90/180/365/730/etc." days breakdown, as well as a bunch of variations about orders and things other than orders (e.g. "average time between new orders, over past 7/30/90 days" or "average $ spent over past 7/30/90 days", as features that serves as a proxy for frequency of activity and level of engagement)

...what data warehouse/database are you using that can't support those queries on-demand at low-latency and granularity?

It seems to me that these 'feature stores' are just tables of pre-computed common aggregates, optionally with some kind of versioning id/timestamp. In which case, why do I need some special service to do this, can't I just make a second database in my warehouse and write something that just computes all desired aggregations periodically and dumps them into the second db? Even easier, just use stored process and distribute them to data scientists, and use timestamp parameters to control versioning.


You can certainly get pretty far with that approach, and I wouldn't suggest doing anything else until you've outgrown it.

Semi-/un-related anecdote, but looking back at what my team was like 6 years ago, we used to be a classic enterprise data warehousing team, and our worldview at the time was basically that, "everything is just a bunch of databases or file systems, and our job is to copy (literally scp) files from source A to destination B." All problems (literally, all) would be solved as one of, "how can I export a file from this source, and what kind of table do we need to create in the DWH?" By that logic, we would make decisions like, "why do you need Splunk/Promethus/Datadog/etc. for operational monitoring, when you could just write those metrics to tables in the DWH?".

In retrospect, what is surprising about such a worldview, is not that it was so limiting or that it hurt us in the long run (it did, but less than one might expect), but how far you could actually go and how much you could actually achieve, with such simple primitives.


Thanks for sharing; I found this to be a much more compelling proposition than the list-of-features in the posted article. I don't have all the problems Logical Clocks solves, but I can see my team in this story.

Do you have any particular pitfalls you could share that others might benefit from, mis-steps you made in this journey we might avoid? What really sucked about this process?


It's hard to give really generic useful guidance, so I would suggest that first you should make sure to self-evaluate what kind of data you've got, how broad/comprehensive is it, what kind of query patterns you'll commonly use, and to be ready to profile those query patterns to see/understand/fix bottlenecks as and when they emerge. If you don't have enough data (and this is subjective and depends on the domain), it's probably not worth the effort to architect a feature repository system that only has 1000 customers and 10 features. OTOH, at small/tiny scale, you really don't have to worry much about performance or complexity, so maybe it's nice to try something simple out quickly, even if it doesn't provide much business value right away.

One milestone for success I think is, when your feature repository can be used to regularly gleam new insights that are surprising (even to your experts). So for example, imagine you are a typical retailer selling widgets, and you have 10000 customers who bought widget X, and you want to identify out of your remaining 10 million customers, which of those customers should you target as likely to want to also buy X? Now, of course, this is a classic problem and doesn't require a fancy feature repository; just the basic order information of who-bought-what is enough to create the classic collaborative filtering solution from.

But with a rich/useful feature repository, you can throw this problem in there, looking for the most similar other customers compared to the 10000 you know, and to rank which features are most powerful (in terms of, why it makes them similar), and you should see some interesting insights that you never would've guessed on your own.

Just to answer the "what really sucked" question a little bit, even if it's implementation-specific... man it really sucked to start out with 2-out-of-5 level of knowledge about Apache Spark, and to stumble through learning both the obvious lessons and the non-obvious/esoteric bugs/quirks, until after 1-2 years the team finally felt comfortable with like an 4-4.5/5 level of knowledge about Spark. This was also specific to the state of Spark back in ~2017 or so, and it's come a long way since then.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: