Why integrate Linkedin Feathr into your stack? This tool can be really useful to automate and improve your Machine Learning (ML) Feature Management in an easy way.
As it was explained in a previous article, Feature Engineering is “the preparation of raw data that is to be used in the machine learning model.” In this matter, the success of a model depends on the quality of the data that feeds it. Data scientists spend most of their work time on tasks related to data preparation. However, to achieve this efficiently it poses different challenges: feeding quality data, cleaning and organizing files, picking the right algorithm and model, creating the best features, generating outputs, and more.
There are various techniques of Feature Engineering to help in this venture. Combined with the right tools, it can make a huge difference when working with any ML project. Precisely, LinkedIn open sourced Feathr and promises to be “the feature store we built to simplify machine learning (ML) feature management and improve developer productivity.”
What is Feathr?
For LinkedIn data scientists, preparing and managing features was one of the most time-consuming tasks. Additionally, this made it difficult to scale ML applications. It was because having non common processes, frameworks and ways to develop projects meant difficulties to share or reuse work.
Thinking of these limitations, the team created and released Feathr in 2017. It’s “an abstraction layer that provides a common feature namespace for defining features and a common platform for computing, serving, and accessing them “by name” from within ML workflows”, according to LinkedIn. Feathr is also categorized as a “feature store”, a recent term “to describe systems that manage and serve ML feature data.”
A “producer” can define and register features into fear, and “consumers” can access those features and incorporate them into their ML model workflows. This allows to reduce time and resources needed, but also to standardize the way features are created. “For model training, features are computed and joined to input labels in a point-in-time correct way, and for model inferencing, features are pre-materialized and deployed to online data stores for low-latency online serving. Features defined by different teams and projects can easily be used together, enabling collaboration and reuse.”, the company explains.
And this Datanami blog post adds: “Instead of manually working with features as part of an individual data pipeline, Feathr automates and standardizes the interaction with the data type, which is used in both the training and inference stages of machine learning.”
Nowadays, LinkedIn uses it for different and big ML projects, as well as deployed dozens of applications involving core functionalities, such as Search, Feed, and Ads.
With Feathr, you can:
- Define features based on raw data sources.
- Create features for different scenarios. For example: model training or model serving.
- Use APIs based on Python.
- Share the features with your entire team to reuse them.
- Use features created by other team members into your pipelines and projects.
- Connect offline data sources to transform them into features.
- Integrate other tools, such as Databricks and Azure Synapse, in a native way.
This article in GitHub highlights some of the main capabilities of Feathr:
- Feathr UI “provides an intuitive UI so you can search and explore all the available features and their corresponding lineages.”
- Rich UDF Support: “Feathr has highly customizable UDFs with native PySpark and Spark SQL integration to lower learning curve for data scientists.”
- Determine Window Aggregation Features with Point-in-time correctness.
- Define features on top of other features – Derived Features.
- Define Streaming Features.
- Point in Time Joins.
Additionally, it shows how Cloud integrations and architecture work with Feathr:
As it was explained before, Feathr can be integrated with other tools, including Azure catalog. There are a bunch of Feathr components, they are available to work with one of those tools, in different tasks and phases of your projects: Object store, streaming, governance, computing engine, credentials, and more.
An Open Source Tool
One of the main advantages of Feathr: It’s now Open Source. Since April 2022, Feathr is available for developers, data scientists and any ML profissional. The project is also under LF AI & Data Foundation. “We aim to support Feathr to expand its user base, grow its community of developers, become a leader within its own category, and enable collaboration and integration opportunities with other projects. We look forward to the project’s continued growth and success as part of LF AI & Data”, said Dr. Ibrahim Haddad, Foundation’s General Manager.
How to install Feathr
To install Feathr client in a python environment, use this:
pip install feathr
Or use the latest code from GitHub: