The data side of machine learning pipelines
This post provides a translation articles on Medium from Sarah wooders, Peter Schafhalter, and Joey gonzalez… The translation was prepared with the support of the analytical course community DataLearn and telegram channel Data Engineering
We need a meaningful way to manage state in real-time ML pipelines.
Rise * Vaults of Signs
As more and more models are deployed in modern pipelines, the understanding that data and its featurization more important than anything else. The latest generation of big data systems scaled ML to real datasets, data warehouses are quickly becoming the new frontier for connecting models to real-time data. …
Keeping features up to date is critical to model accuracy, but expensive and difficult to scale.
Feature stores, as the name suggests, store features derived from raw data and provide them to subsequent models for training and implementation ***. For example, the feature store can store the last few pages that the user viewed (i.e., a sliding window along the site navigation route), as well as the last predicted user demographic data (i.e., a prediction of some model), which can be valuable features. for the ad targeting model.
Unfortunately, many feature repositories being created today are Frankensteins from batch systems, streaming systems, caching systems, and storage systems.
In this post, we will (1) define what feature stores are and how they are used today, (2) highlight some of the design constraints of the current generation of feature stores, and (3) describe how innovations in feature store design can transform industrial machine learning by driving state in the learning and inference pipelines in a more meaningful way.
* The original uses the abbreviation RISE (in translation sunrise) — Real-time Intelligent Secure Explainable Systems. This is the name of the laboratory in which blog article published.
** Featurization is a term used to refer to the process of converting raw, not always numerical, data to the form of a vector of numbers on which the model learns better. Example: vectorizing words of text with TF-IDF
*** Inference is the stage of a machine learning model where an algorithm is used to transform input data to obtain conclusions. Inferences can take different forms, from a scalar value, which characterizes the probability of an event, to a multidimensional tensor, which is an image of a person’s face.
Why feature repositories?
A simple ML pipeline trains a model based on a static dataset and then provides a model to respond to user inference queries.
However, in order to adapt to an ever-changing world, modern ML pipelines must make decisions that depend on real-time data. … For example, a model predicting arrival time might use such characteristics such as the lead time of a recent order at a restaurant, or the content recommendation model can take into account the user’s recent clicks. Therefore, training and inference of models rely on features obtained in real time as a result of connecting, transforming and aggregating incoming data streams. Since the featureization step can be expensive, features must be precomputed and cached to avoid redundant computations and meet stringent prediction latency requirements.
What is the harinlite of signs?
Feature stores are used to store and deliver features in multiple parts of the pipeline, allowing for collaborative computation and optimization. Although the various trait stores differ in their functionality, they usually manage the following:
By supplying features to meet various request latency requirements… Traits are usually hosted in both fast “online storage” (for queries during inference) and durable “offline storage” (for queries during training).
Make features compatible and complementary. Once a feature is defined, it should be easy to connect with subsequent models, derive additional features from it, or redefine a feature schema or featureization function.
Maintaining features derived from real-time data. Maintaining such characteristics is resource intensive and obsolete characteristics can negatively affect forecasting performance.
Some features (for example, aggregation results over a 1 minute time window) are very sensitive to aging and should be super-fresh, while others (for example, 30-day windows) may only need periodic batch updates. As a system that interacts with both feature updates and feature requests, feature stores have a good head start on balancing freshness, latency, and cost.
Feature Stores Today: Challenges and Limitations
Today, many companies already have internal development of data warehouses. This is done so that features are available for models deployed in production.
The table above shows examples of using feature stores
Feature stores today build on existing streaming, batch processing, caching, and storage systems. Although each of these systems individually solves complex problems, their limitations create problems for feature stores.
Batch processing systemssuch as Spark allow complex queries on static datasets, but introduce excessive delays in feature delivery and cause full recalculation when data is enriched.
Streaming systemssuch as Flink and Spark Streaming provide low-latency pipelined computations, but are ineffective when there is a large number of states to maintain. Lambda architectures combine batch and streaming systems, but lead to costly redundant computations and complex maintenance of both streaming and batch codebases.
Streaming databases with materialized views, they have the advantages of fast computation and storage, but are difficult to adapt to arbitrary feature operations. Their latencies can also be too high to service predictions.
Key / Value In-Memory Storesfeatures such as Redis provide fast access to traits, but they are usually difficult to consistently update and expensive to scale.
Many feature store requirements can be met with a combination of these systems. However, the resulting pipeline is rigid and difficult to optimize as a whole. For example, prioritizing featureization tasks based on their impact on the overall forecasting accuracy will require coordination between the data warehouse receiving the requests, the streaming system delivering real-time updates, and the historical batch processing system. Rather than clumsily combining multiple computing engines with multiple databases to achieve different latency goals, feature stores should take advantage of their access to incoming events and query patterns to optimize latency, compute cost, and prediction accuracy in a centralized manner.
The future of feature repositories
We believe that feature stores can offer centralized state management for ML pipelines and have great potential to do so:
Data lifecycle management (Lineage Management)… Feature stores open the door to new data-driven abstractions for designing and customizing machine learning pipelines. The complexity of existing machine learning pipelines often makes it difficult to achieve basic reproducibility, apply pipeline changes, or optimize the entire pipeline. While careful versioning and synchronization can solve these problems to some extent, it is difficult to imagine applying these techniques to ever-changing datasets. A data-driven view of pipelines (e.g. treating data pipelines as materialized views) has the potential to introduce new abstractions that will simplify the propagation of data and operator changes.
End-to-end optimization. Feature stores are well positioned to provide new end-to-end optimizations in machine learning data pipelines. Existing systems make calculations either events or requests, making it difficult to schedule tasks in a way that optimizes metrics such as forecasting performance and cost. Practitioners should be able to tweak their pipelines to optimize cost (lazy evaluation / updates, rough results), inference latency (immediate computation), or overall prediction performance (feature update with the greatest impact).
Scalable state management… Feature stores require scalable maintenance and persistence in ML pipelines. Real-time industrial ML pipelines often need to maintain tens of millions of characteristics derived from multiple incoming dense data streams. Feature sets can be too large to be stored in memory or updated on every incoming event from a stream, as the threading system does by default, but they need to be updated more often than the batch system allows.
We’re actively exploring the design of feature repositories, so let us know if you’re interested in keeping up with the latest news or collaborating!
If you would like to participate in our research, do not hesitate to contact email@example.com.
 By “real-time data” we mean data that needs to be processed promptly in real time, both in the context of delivering forecasts online and to keep data fresh for functions.
 Updates to “real-time” data should typically be in the order of a few seconds, but may vary based on workload.
Thanks to Manmeet Gujral, Gaetan Castelein, and Kevin Stumpf from Tecton, as well as Joe Hellerstein, Natacha Crooks, Simon Mo, Richard Liaw, and other RISELab members for providing feedback on this post.