The Delta Architecture is gaining popularity and advocates around the Big Data world. It’s because it offers more simplicity, quality, and reliability with ACID transactions, compared to other options, such as Lambda or Kappa Architectures.
As pointed out by Denny Lee, Developer Advocate at Databricks, a data engineer’s dream is to “process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming”. And the Delta Architecture promises to move engineers one step closer to that dream.
Previously, we discussed the differences between Lambda vs. Kappa Architecture. So, now it’s time to focus and understand better how the Delta Architecture can be an evolution for data management.
The Lambda Architecture, an Old Friend
During the start of the decade of 2010’s, processing data, especially huge amounts of datasets, in real-time was still a problem. Latency, complexity, and no single tools to build a Big Data System were some of the problems signalized by Nathan Marz at the time. In this context, Marz proposed the Lambda architecture, which tried to solve this problem with a hybrid approach, “by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.”
In this architecture, the Batch Layer can take its time to process tons of data that take a lot of computation time (cold path), while the Speed Layer computes in real-time and performs incremental updates to the batch layer results (hot path). Finally, the Serving Layer takes the outputs of both and uses this data to solve pending queries. Additionally, ‘it features an append-only immutable data source that serves as a system of record. Timestamped events are appended to existing events and nothing ever gets overwritten”, as this blog post refers.
However, complexity has always been a downside. “While a Lambda architecture can handle large volumes of batch and streaming data, it increases complexity by requiring different code bases for batch and streaming, along with its tendency to cause data loss and corruption. In response to these data reliability issues, the traditional data pipeline architecture adds even more complexity by adding steps like validation, reprocessing for job failures, and manual update and merge”, says Hector Leno in this article.
The Kappa Architecture, an Improvement
Later, the Kappa Architecture appeared as an alternative. It’s event-based and doesn’t separate the layers. The Kappa Architecture only has the Streaming Layer and the Serving Layer, so every kind of data that needs to be processed will be handled by a single technology stack.
The Kappa proposal represented an evolution to data processing and data analysis. Anyhow, it still has a high level of complexity for implementation and a use of extensive compute resources. It’s also hard to scale.
Delta Architecture, a New Approach
Currently, the Delta Architecture seems to be the next step around data. But first, it’s better to be familiar with the Delta Lake concept, considering the Delta Architecture relies on that. Delta Lake, as we explained previously, is an open-source storage framework that brings ACID transaction support and schema enforcement to Apache Spark-driven data lakes. It allows users to build a “Data Lakehouse” architecture that works with structured, semi-structured, and unstructured data.
Delta Lake “extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.” It’s also compatible with Apache Spark APIs and integrated with Structured Streaming. Additionally, the separation between Layers in Delta Architecture is minimal compared to the Lambda Architecture, so there’s no need to treat data differently based on its source.
Considering this context, Databricks presents it as “a completely different approach to ingesting, processing, storing, and managing data focused on simplicity. All the processing and enrichment of data from Bronze (raw data) to Silver (filtered) to Gold (fully ready to be used by analytics, reporting, and data science) happens within Delta Lake, requiring less data hops”.
Delta Architecture’s Promises
- Lower your costs: its simplicity helps you reduce costs significantly by reducing the amount of data that needs to be sent and received, the time needed to process data, as well as the amount of times you need to run jobs because of failures.
- Delta = Less code: as we already said, Lambda Architectures need different code bases for each part of the architecture. But using Delta, as transactions are ACID compliant, you ensure your code is less complex because several parts of the code that needed to be done manually (to guarantee data consistency, for example) aren’t needed anymore.
- Improved Indexing: when you use Delta Lake as the storage for your architecture, you bring together the capabilities of using Bloom Filter Indexes, which improve query execution performance by over 50%, according to MSSQLTips.com.
- One source of data: when using other architectures and trying to simplify processes, data will often be copied from a data lake to other smaller data warehouses. This creates consistency and versioning issues that are solved by using the Delta Architecture.
- Adding more data sources? No problem: usually, after a data architecture is designed and deployed for a specific use case, it’s hard for new data sources to be added. But when using Delta Lake as your engine, this no longer presents an enormous challenge as schema evolution makes adding new data sources (or changing the formats of existing data sources) a simpler task.
In a world seeking to be data-driven, developing a robust solution that is able to scale and handle any amount or type of data, has been the biggest challenge in the last few years. In this time, proposals such as Lambda and Kappa architecture have surged as a response to this need. However, they’re still far from ideal.
“There have been attempts to unify batch and streaming into a single system in the past. Organizations have not been that successful though in those attempts. But, with the advent of Delta Lake, we are seeing a lot of our customers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, the Delta Architecture”, explains Databricks, the company behind it.
“Using this approach, we can improve our data through a connected pipeline that allows us to combine streaming and batch workflows through a shared file store with ACID-compliant transactions and provides the best of both worlds”, complements this analysis.