Delta Lake Enables ACID Compliance in Apache Spark

Search

Table of Contents

The Basics: What is ACID Compliance?

When performing database operations, such as adding new records, new fields, or updating existing ones, data should be treated very carefully. One of the methods to guarantee this is using ACID Compliance. ACID is an acronym for Atomicity, Consistency, Isolation, and Durability, coined by Andreas Reuter and Theo Härder, and based on an early work by Jim Gray. 

This way, ACID represents a group of guiding principles that should help you make sure database transactions are performed in a reliable manner. Additionally, databases can be maintained with integrity and security.

  • Atomicity, in this context, means that there are only two possible outcomes for database transactions: either it’s completed successfully or there was no operation at all. This ensures that data is processed correctly or the database is rolled back to its previous state, with no data corruption.
  • Consistency of a data transaction implicates that data will always be in a valid state, meaning that constraints placed on the dataset will not be violated. If consistency is not maintained the operation will fail and the database will be rolled back to its previous, valid state.
  • Isolation, for data transactions, means that other concurrent operations being performed on the same data should not be impacted by your current operation.
  • Durability of data represents the fact that successfully committed operations to the database will be permanent. No database failure will corrupt data.

Delta Lake enhances an already powerful engine as Apache Spark

Apache Spark defines itself as “a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters”. 

By itself Apache Spark serves as a tool that enables enterprises to do large-scale data processing, but it’s not ACID compliant. This means that every one of the principles that we discussed before can be broken if you are not careful when programming. Even if you take every measure you can think of, problems may arise and corrupt data could pollute your data lake. In certain scenarios, you could even permanently delete part of your data on a faulty processing job. Data integrity is in the hands of the programming team, and we are all humans.

Because of these problems, Apache Delta Lake emerges as a new paradigm. It’s an open-source storage framework that brings ACID transaction support and schema enforcement to Apache Spark-driven data lakes. It allows users to build a “Data Lakehouse” architecture that works with structured, semi-structured, and unstructured data.

This Data Lakehouse is efficient to scale, but it’s also reliable as it ensures data integrity. Delta Lake lets you revert your ecosystem to earlier versions of your data for audits or rollbacks. Furthermore, it logs all transaction details, thus providing an audit trail.

Delta Lakes helps your organization refine data

At each stage shown in the figure above, data is incrementally improved through a pipeline that enables both streaming and batch processing through a shared file store. Data is organized into 3 layers, or tables as the image above shows, commonly referred to as Bronze (table 1), Silver (table 2), and Gold (table 3). 

  • Bronze tables contain raw data, unmodified from where it came from (IOT Data, JSON files, PARQUET files, etc). Original data can be reprocessed in the future or can be also used by other processing engines.
  • Silver tables incorporate a more refined view of your organization’s data. It can be queried directly and data can be considered clean. You can begin to take action based on this refined data.
  • Gold tables have business-level aggregated data, from which analytics can be built. Dashboards and presentations are created from this type of data. 

How does Delta Lake enable ACID Compliance in Apache Spark?

“The transaction log is the mechanism through which Delta Lake can offer the guarantee of atomicity. If it’s not recorded in the transaction log, it never happened.” This phrase thoroughly describes Delta Lake’s transaction log in a Databricks post.

When processing data with Delta Lake, one of the steps it will take is to check for a schema mismatch when validating the result of a process. This error could happen anytime, but checking in every transaction helps with data consistency and you will not get tables with corrupt data. This is also great for debugging. 

As we said earlier, every job done in Delta Lake is logged and versions of valid data are saved. This way, Delta Lake ensures that data is durable in its system while providing a log trail. Any old data is accessible so you can query it, or data that has been already transformed by other processes, in its previous state.

The aforementioned log trail of committed transactions is key for the isolation principle to be followed in Delta Lake. A read or write job will work on the most recent data that has been already committed, without taking into account processes that are being done simultaneously but haven’t finished yet, or haven’t been logged in the commit log.

Delta Lake is a great tool to add to your data processing projects, especially when you want your databases to be ACID-compliant. If you’re already expecting to use Apache Spark then it’s a no-brainer, as Delta Lake is fully integrated to work hand by hand with Spark.

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Scroll to Top