LakeFS Brings Git-like Version Control to Data Lakes

What is LakeFS?

Let’s start by stating that a data lake is a centralized storage that allows you to store your data as it comes, in very big volumes. You can have a single repository containing various types of unstructured and structured data, no file type limitations and no fixed size limitations.

LakeFS was designed to “transform object storage buckets into data lake repositories that expose a Git-like interface”. “The Git-like interface means users of LakeFS can use the same development workflows for code and data. Git workflows greatly improved software development practices; we designed LakeFS to bring the same benefits to data.” These are direct quotes from the LakeFS official website, to explain how their product works.

Treeverse, a Israeli startup, developed LakeFS. The goal of Treeverse is to simplify the lives of engineers, data scientists and analysts by providing solutions to big problems and contributing to the open source community.

So, following this idea, LakeFS brings to the table better manageability for data lakes, without compromising flexibility as it can be used in projects that are running in AWS S3, Google Cloud Storage or Azure Blob Storage. LakeFS is also ready to work together with the most important data frameworks, such as Kafka, Apache Spark, Delta Lake, Amazon Athena, Databricks and Hadoop. “LakeFS enables simplified workflows when developing data lake pipelines”, they explain.

Where Does LakeFS Fit Into a Modern Architecture?

In modern enterprises, you have to process and take care of vast amounts of data. Typically, you have various data sources and cloud-based data lakes. In this context, data is extracted and loaded into the data lake. Well, LakeFS sits between the ETL process (extract, transform and load) and the data lake.

“Integrating ETL technologies with LakeFS enables writing new data to a designated branch, and testing it to ensure quality before exposing to consumers”, Einat Orr, Treeverse cofounder, explained to venturebeat.com. “This workflow provides important guarantees about production data to consumers of the data.”

Other products in the same line of work include Iterative.ai’s DVC. It’s aimed directly at data scientists working with machine learning models. Delta Lake is also part of the tools that can work with data lakes, but it’s limited because you can’t work with all the datasets at the same time.

On the other hand, LakeFS was designed to include a broader amount of use cases. Basically, anyone working with data can benefit from functionalities built in LakeFS. LakeFS technology enhances data visibility and increases efficiency across your organization. Being open-source, data scientists and engineers can participate in designing solutions to meet their own or their peer’s needs.

What Can You Do with Git-like Version Control in Data Lakes?

Creating a branch that is isolated from the rest and is a copy of the original repository. It doesn’t duplicate objects, it’s cost effective via its copy-on-write mechanism. A new branch can be used to reprocess data in an isolated manner.
Using the commit operation you can create checkpoints containing complete snapshots of a repository.
With the aforementioned checkpoints, you can revert your whole repository to a previous state of committed data. This is especially useful to recover from data errors.
Merging two branches, updating one with the changes made to the other. This enables synchronous updates for two or more data assets.
Creating tags that point to a single commit with a more usable and meaningful name than what you would have to normally use.

Summary

LakeFS is one more tool to work with on your data projects. It’s production ready, open source, helps you to have more control over the tasks performed on your data, is easy to attach to a project that is already in production and it adds new value with its branching, merging and commit features.

LakeFS Brings Git-like Version Control to Data Lakes

Table of Contents

What is LakeFS?

Where Does LakeFS Fit Into a Modern Architecture?

What Can You Do with Git-like Version Control in Data Lakes?

Summary