Data processing and data analysis play a key role in the decision-making process of businesses. However, these tasks are continuously being done behind curtains, and choosing how to start building an adequate data infrastructure to meet your needs could be hard.
ETL is the process of extracting, transforming, and loading data from various sources into data warehouse systems. It used to be done in batches, but modern architectures are evolving and ETL now happens in near real-time whenever possible. Besides near real-time capabilities, a modern data infrastructure should be flexible, scalable, automated, elastic, extensible, as well as being able to support updates and new applications.
In this context, Kappa and Lambda architectures can handle both real-time (streaming data) and static (batch) datasets, but they do so in very different ways.
Nathan Marz, who described the Lambda paradigm first, explained that “computing arbitrary functions on an arbitrary dataset in real-time is a daunting problem. There is no single tool that provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data system”. To approach this problem, Lambda architectures attempt to solve the issue of processing both kinds of data streams in a hybrid fashion, “by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer”.
When data feeds the system, the Batch Layer and the Speed Layer receive it simultaneously. This starts the processing, and then, the batch layer corrects the stream data. However, each layer has a specific task to complete.
This means that Lambda architecture separates the processing of real-time data from batch data. The Batch Layer can take its time to process tons of data that take a lot of computation time while the Speed Layer computes in real-time and performs incremental updates to the batch layer results. The Serving Layer takes the outputs of both and uses this data to solve pending queries.
The Kappa Architecture is event-based and only has the Streaming Layer and the Serving Layer, so every kind of data that needs to be processed will be handled by a single technology stack. This paradigm solves the redundancy in Lambda architectures, avoiding the maintenance of two different code bases to feed the layers, as well as the multi-layered structure.
The idea behind doing everything in the Streaming Layer is replaying data: this means handling continuous data reprocessing and real-time data using only one processing engine, thus avoiding maintaining different code for different layers. Kappa may appear simpler, but what happens when data is out of order and needs to be rearranged, a field needs to be added, or items need to be cross-referenced? These challenges are solved easier by batch-processing and Kappa architectures were not designed to handle this kind of operation, although solutions to those problems have emerged with time.
Pros and Cons of Lambda and Kappa Architectures
|Out-of-order data problems are solved by batch processing.||It solves the redundancy problem of Lambda architecture with a single set of infrastructure.|
|It balances speed, reliability, and scalability.||Only uses one codebase.|
|Processing complete data sets open a window for code optimizations.||It uses fewer resources.|
|Large data sets (petabyte range) can be processed more efficiently.||Having every data point as a streaming event allows you to see the complete state of the organization at any point in time.|
|It covers many data analysis scenarios.||Queries only need to search a single location.|
|No need to modify the architecture for new use cases.|
|Double coding for batch and streaming layers could mean two Dev teams.||Difficult to implement, especially the data replay part.|
|Changes in one place need to be propagated to the other.||Handling duplicate events or maintaining the order of data is difficult when all data is processed in a stream.|
|Double infrastructure means double investment, double monitoring, and double logs.||Harder to scale.|
|Duplicate modules and lots of coding overhead.||The use of extensive compute resources is more expensive than cold storage.|
|Reprocesses every batch cycle.|
|It’s difficult to migrate when a data set has been modeled to be used in a Lambda architecture.|
Use Cases for Lambda Architectures
- When stored data is permanent and you need to be able to update and incorporate new data into the database.
- When immutable storage is being used and user queries are required to be answered as they happened.
- When quick responses are required, the system needs to handle several updates in other data streams.
Use Cases for Kappa Architectures
- When algorithms applied to real-time data and historical datasets are the same, using a single data processing layer is beneficial.
- When developing data systems that are online learners, meaning that there is no need for a batch layer.
- In general, a Kappa Architecture can solve elegantly any other use case that is not specific to Lambda requirements.
Lambda vs Kappa Architecture: How to Choose One for my Business?
When starting a big data project, a comprehensive analysis helps you decide which architecture to use, remembering that migration from one to the other is not simple and sometimes an impossible mission. So, choosing the right architecture from the beginning is key.
Kappa and Lambda architectures can be complementary, for now. Perhaps someone may prefer one or the other, but the final decision will ultimately depend on the needs of their project. Nevertheless, we’ll continue to see data analysis projects being done with both as there is no single solution for every kind of problem presented in the real-time data processing.