Diving Into Delta Lake: Unpacking The Transaction Log-摩杜云开发者社区

Diving Into Delta Lake: Unpacking The Transaction Log_数据库

The transaction log is key to understanding Delta Lake because it is the common thread that runs through many of its most important features, including ACID transactions, scalable metadata handling, time travel, and more. In this article, we’ll explore what the Delta Lake transaction log is, how it works at the file level, and how it offers an elegant solution to the problem of multiple concurrent reads and writes. 事务日志是理解Delta Lake的关键，因为它是贯穿其许多最重要功能的公共线程，包括ACID事务、可扩展元数据处理、时间旅行等。在本文中，我们将探讨什么是DeltaLake事务日志，它如何在文件级别工作，以及它如何为多个并发读写问题提供优雅的解决方案。

What Is the Delta Lake Transaction Log?

The Delta Lake transaction log (also known as the DeltaLog) is an ordered record of every transaction that has ever been performed on a Delta Lake table since its inception. Delta Lake事务日志（也称为DeltaLog）是自创建以来在Delta Lake表上执行的每个事务的有序记录。

What Is the Transaction Log Used For?

Single Source of Truth

Delta Lake is built on top of Apache Spark™ in order to allow multiple readers and writers of a given table to all work on the table at the same time. In order to show users correct views of the data at all times, the Delta Lake transaction log serves as a single source of truth - the central repository that tracks all changes that users make to the table. Delta Lake建在Apache Spark之上™ 以便允许给定表的多个读取器和写入器同时在该表上进行所有工作。为了随时向用户显示数据的正确视图，Delta Lake事务日志充当了一个单一的真相来源，即跟踪用户对表所做的所有更改的中央存储库。

When a user reads a Delta Lake table for the first time or runs a new query on an open table that has been modified since the last time it was read, Spark checks the transaction log to see what new transactions have posted to the table, and then updates the end user’s table with those new changes. This ensures that a user’s version of a table is always synchronized with the master record as of the most recent query, and that users cannot make divergent, conflicting changes to a table.
当用户第一次读取Delta Lake表或对自上次读取以来已修改的打开表运行新查询时，Spark会检查事务日志，查看哪些新事务已发布到表中，然后用这些新更改更新最终用户的表。这样可以确保用户版本的表在最近的查询时始终与主记录同步，并且用户不能对表进行不同的、冲突的更改。

The Implementation of Atomicity on Delta Lake

One of the four properties of ACID transactions, atomicity, guarantees that operations (like an INSERT or UPDATE) performed on your data lake either complete fully, or don’t complete at all. Without this property, it’s far too easy for a hardware failure or a software bug to cause data to be only partially written to a table, resulting in messy or corrupted data. ACID事务的四个属性之一，原子性，保证在数据湖上执行的操作（如INSERT或UPDATE）要么完全完成，要么根本不完成。如果没有此属性，硬件故障或软件错误很容易导致数据仅部分写入表，从而导致数据混乱或损坏。

The transaction log is the mechanism through which Delta Lake is able to offer the guarantee of atomicity. For all intents and purposes, if it’s not recorded in the transaction log, it never happened. By only recording transactions that execute fully and completely, and using that record as the single source of truth, the Delta Lake transaction log allows users to reason about their data, and have peace of mind about its fundamental trustworthiness, at petabyte scale.
事务日志是德尔塔湖能够提供原子性保证的机制。无论出于何种目的，如果它没有记录在事务日志中，那么它就永远不会发生。通过只记录完全执行的交易，并将该记录作为唯一的真实来源，德尔塔湖交易日志允许用户对自己的数据进行推理，并在PB级级别上对其基本可信度放心。

How Does the Transaction Log Work?

Breaking Down Transactions Into Atomic Commits

Whenever a user performs an operation to modify a table (such as an INSERT, UPDATE or DELETE), Delta Lake breaks that operation down into a series of discrete steps composed of one or more of the actions below. 每当用户执行修改表的操作（如INSERT、UPDATE或DELETE）时，DeltaLake都会将该操作分解为一系列离散的步骤，这些步骤由以下一个或多个操作组成。

Add file - adds a data file.
Remove file - removes a data file.
Update metadata - Updates the table’s metadata (e.g., changing the table’s name, schema or partitioning).
Set transaction - Records that a structured streaming job has committed a micro-batch with the given ID.
Change protocol - enables new features by switching the Delta Lake transaction log to the newest software protocol.
Commit info - Contains information around the commit, which operation was made, from where and at what time.

Those actions are then recorded in the transaction log as ordered, atomic units known as commits. 然后，这些操作将作为有序的原子单元（称为提交）记录在事务日志中。

For example, suppose a user creates a transaction to add a new column to a table plus add some more data to it. Delta Lake would break that transaction down into its component parts, and once the transaction completes, add them to the transaction log as the following commits: 1. Update metadata - change the schema to include the new column 2. Add file - for each new file added
例如，假设用户创建一个事务，向表中添加一个新列，并向其中添加更多数据。DeltaLake会将该事务分解为其组成部分，一旦事务完成，将它们添加到事务日志中，如下所示：

更新元数据-更改架构以包含新列
添加文件-针对添加的每个新文件

The Delta Lake Transaction Log at the File Level

When a user creates a Delta Lake table, that table’s transaction log is automatically created in the _delta_log subdirectory. As he or she makes changes to that table, those changes are recorded as ordered, atomic commits in the transaction log. Each commit is written out as a JSON file, starting with 000000.json. Additional changes to the table generate subsequent JSON files in ascending numerical order so that the next commit is written out as 000001.json, the following as 000002.json, and so on.

当用户创建一个DeltaLake表时，该表的事务日志会自动创建在_Delta\log子目录中。当他或她对该表进行更改时，这些更改将作为有序的原子提交记录在事务日志中。每个提交都被写成一个JSON文件，从000000.JSON开始。对表的其他更改会以升序生成后续的JSON文件，因此下一个提交被写成000001.JSON，下面的写成000002.JSON，依此类推。

Diving Into Delta Lake: Unpacking The Transaction Log_ci_02

Even though 1.parquet and 2.parquet are no longer part of our Delta Lake table, their addition and removal are still recorded in the transaction log because those operations were performed on our table - despite the fact that they ultimately canceled each other out. Delta Lake still retains atomic commits like these to ensure that in the event we need to audit our table or use “time travel” to see what our table looked like at a given point in time, we could do so accurately. Also, Spark does not eagerly remove the files from disk, even though we removed the underlying data files from our table. Users can delete the files that are no longer needed by using VACUUM.

尽管1.parquet和2.parquet不再是我们三角洲湖表的一部分，但它们的添加和删除仍会记录在交易日志中，因为这些操作是在我们的表上执行的，尽管它们最终相互取消了。德尔塔湖仍然保留这样的原子提交，以确保在我们需要审计我们的表或使用“时间旅行”来查看我们的表在给定时间点的样子时，我们可以准确地这样做。此外，Spark不会急切地从磁盘中删除文件，即使我们从表中删除了底层数据文件。用户可以使用VACUUM删除不再需要的文件。

Quickly Recomputing State With Checkpoint Files

Once we’ve made several commits to the transaction log, Delta Lake saves a checkpoint file in Parquet format in the same _delta_log subdirectory. Delta Lake automatically generates checkpoint as needed to maintain good read performance. 一旦我们对事务日志进行了多次提交，DeltaLake就会在相同的_Delta_log子目录中以Parquet格式保存一个检查点文件。DeltaLake会根据需要自动生成检查点，以保持良好的读取性能。

Diving Into Delta Lake: Unpacking The Transaction Log_JSON_03

These checkpoint files save the entire state of the table at a point in time - in native Parquet format that is quick and easy for Spark to read. In other words, they offer the Spark reader a sort of “shortcut” to fully reproducing a table’s state that allows Spark to avoid reprocessing what could be thousands of tiny, inefficient JSON files. 这些检查点文件在某个时间点保存表的整个状态——采用Spark快速轻松读取的原生Parquet格式。换句话说，它们为Spark阅读器提供了一种完全复制表状态的“快捷方式”，使Spark可以避免重新处理数千个微小、低效的JSON文件。

To get up to speed, Spark can run a listFrom operation to view all the files in the transaction log, quickly skip to the newest checkpoint file, and only process those JSON commits made since the most recent checkpoint file was saved. 为了加快速度，Spark可以运行listFrom操作来查看事务日志中的所有文件，快速跳到最新的检查点文件，并且只处理自保存最新检查点文件以来进行的JSON提交。

To demonstrate how this works, imagine that we’ve created commits all the way through 000007.json as shown in the diagram below. Spark is up to speed through this commit, having automatically cached the most recent version of the table in memory. In the meantime, though, several other writers (perhaps your overly eager teammates) have written new data to the table, adding commits all the way through 0000012.json. 为了演示这是如何工作的，假设我们已经通过000007.json创建了提交，如下图所示。Spark已经在内存中自动缓存了表的最新版本，从而加快了提交的速度。然而，与此同时，其他几个编写器（可能是您过于热心的队友）已经向表中写入了新数据，通过0000012json一直添加提交。

To incorporate these new transactions and update the state of our table, Spark will then run a listFrom version 7 operation to see the new changes to the table. 为了合并这些新事务并更新表的状态，Spark将运行listFrom版本7操作来查看表的新更改。

Diving Into Delta Lake: Unpacking The Transaction Log_数据库_04

Rather than processing all of the intermediate JSON files, Spark can skip ahead to the most recent checkpoint file, since it contains the entire state of the table at commit #10. Now, Spark only has to perform incremental processing of 0000011.json and 0000012.json to have the current state of the table. Spark then caches version 12 of the table in memory. By following this workflow, Delta Lake is able to use Spark to keep the state of a table updated at all times in an efficient manner.

Spark可以跳到最近的检查点文件，而不是处理所有中间JSON文件，因为它包含提交#10时表的整个状态。现在，Spark只需要执行0000011json和0000012json的增量处理就可以获得表的当前状态。Spark随后将该表的版本12缓存在内存中。通过遵循此工作流，DeltaLake能够使用Spark以高效的方式随时更新表的状态。

Dealing With Multiple Concurrent Reads and Writes

Now that we understand how the Delta Lake transaction log works at a high level, let’s talk about concurrency. So far, our examples have mostly covered scenarios in which users commit transactions linearly, or at least without conflict. But what happens when Delta Lake is dealing with multiple concurrent reads and writes?
既然我们已经了解了DeltaLake事务日志是如何在高级别上工作的，那么让我们来谈谈并发性。到目前为止，我们的示例主要涵盖了用户线性提交事务的场景，或者至少没有冲突。但是，当德尔塔湖处理多个并发读写时会发生什么呢？
The answer is simple. Since Delta Lake is powered by Apache Spark, it’s not only possible for multiple users to modify a table at once - it’s expected. To handle these situations, Delta Lake employs optimistic concurrency control.
答案很简单。由于Delta Lake由Apache Spark提供支持，因此多个用户不仅可以同时修改一个表，这是意料之中的事。为了处理这些情况，DeltaLake采用了乐观并发控制。

What Is Optimistic Concurrency Control?

Optimistic concurrency control is a method of dealing with concurrent transactions that assumes that transactions (changes) made to a table by different users can complete without conflicting with one another. It is incredibly fast because when dealing with petabytes of data, there’s a high likelihood that users will be working on different parts of the data altogether, allowing them to complete non-conflicting transactions simultaneously.
乐观并发控制是一种处理并发事务的方法，它假设不同用户对表所做的事务（更改）可以在不相互冲突的情况下完成。它的速度非常快，因为在处理PB级的数据时，用户很可能会同时处理数据的不同部分，从而使他们能够同时完成不冲突的事务。
For example, imagine that you and I are working on a jigsaw puzzle together. As long as we’re both working on different parts of it - you on the corners, and me on the edges, for example - there’s no reason why we can’t each work on our part of the bigger puzzle at the same time, and finish the puzzle twice as fast. It’s only when we need the same pieces, at the same time, that there’s a conflict. That’s optimistic concurrency control.
例如，想象一下你和我正在一起玩拼图游戏。只要我们都在处理它的不同部分——例如，你在角落，我在边缘——我们就没有理由不同时处理更大谜题的我们的部分，并以两倍的速度完成谜题。只有当我们同时需要同样的东西时，才会发生冲突。这就是乐观的并发控制。
Of course, even with optimistic concurrency control, sometimes users do try to modify the same parts of the data at the same time. Luckily, Delta Lake has a protocol for that.
当然，即使使用乐观并发控制，有时用户也会尝试同时修改数据的相同部分。幸运的是，德尔塔湖有一个协议。

Solving Conflicts Optimistically

In order to offer ACID transactions, Delta Lake has a protocol for figuring out how commits should be ordered (known as the concept of serializability in databases), and determining what to do in the event that two or more commits are made at the same time. Delta Lake handles these cases by implementing a rule of mutual exclusion, then attempting to solve any conflict optimistically. This protocol allows Delta Lake to deliver on the ACID principle of isolation, which ensures that the resulting state of the table after multiple, concurrent writes is the same as if those writes had occurred serially, in isolation from one another.
为了提供ACID事务，DeltaLake有一个协议来确定提交应该如何排序（称为数据库中的可序列化性概念），并确定在同时进行两个或多个提交的情况下该怎么办。三角洲湖通过实施相互排斥的规则来处理这些案件，然后试图乐观地解决任何冲突。该协议允许DeltaLake提供ACID隔离原则，该原则确保在多次并发写入后，表的结果状态与这些写入是串行发生的一样，彼此隔离。
In general, the process proceeds like this:

Record the starting table version. 记录起始表格版本。
Record reads/writes. 记录读取/写入。
Attempt a commit. 尝试提交。
If someone else wins, check whether anything you read has changed. 如果其他人赢了，检查你读到的内容是否有变化。
Repeat. 重复

To see how this all plays out in real time, let’s take a look at the diagram below to see how Delta Lake manages conflicts when they do crop up. Imagine that two users read from the same table, then each go about attempting to add some data to it. 为了了解这一切是如何实时发生的，让我们看看下图，看看三角洲湖在冲突突然出现时是如何管理冲突的。想象一下，两个用户从同一个表中读取数据，然后各自尝试向其中添加一些数据。

Diving Into Delta Lake: Unpacking The Transaction Log_JSON_05

Delta Lake records the starting table version of the table (version 0) that is read prior to making any changes. DeltaLake记录在进行任何更改之前读取的表的起始表版本（版本0）。
Users 1 and 2 both attempt to append some data to the table at the same time. Here, we’ve run into a conflict because only one commit can come next and be recorded as 000001.json. 用户1和2都试图同时将一些数据附加到表中。在这里，我们遇到了一个冲突，因为接下来只能进行一次提交并记录为000001.json。
Delta Lake handles this conflict with the concept of “mutual exclusion,” which means that only one user can successfully make commit 000001.json. User 1’s commit is accepted, while User 2’s is rejected. DeltaLake使用“互斥”的概念来处理这一冲突，这意味着只有一个用户可以成功地提交000001.json。用户1的提交被接受，而用户2的提交被拒绝。
Rather than throw an error for User 2, Delta Lake prefers to handle this conflict optimistically. It checks to see whether any new commits have been made to the table, and updates the table silently to reflect those changes, then simply retries User 2’s commit on the newly updated table (without any data processing), successfully committing 000002.json. 与其为用户2抛出错误，德尔塔湖更愿意乐观地处理这一冲突。它检查是否对表进行了任何新的提交，并静默地更新表以反映这些更改，然后简单地在新更新的表上重试User 2的提交（无需任何数据处理），成功提交000002.json。

Other Use Cases

Time Travel

Every table is the result of the sum total of all of the commits recorded in the Delta Lake transaction log - no more and no less. The transaction log provides a step-by-step instruction guide, detailing exactly how to get from the table’s original state to its current state.

Therefore, we can recreate the state of a table at any point in time by starting with an original table, and processing only commits made prior to that point. This powerful ability is known as “time travel,” or data versioning, and can be a lifesaver in any number of situations. For more information, read the blog post Introducing Delta Time Travel for Large Scale Data Lakes, or refer to the Delta Lake time travel documentation.
每个表都是DeltaLake事务日志中记录的所有提交的总和的结果——不多也不少。事务日志提供了一个循序渐进的指导，详细说明了如何从表的原始状态转换为当前状态。
因此，我们可以在任何时间点重新创建表的状态，方法是从原始表开始，只处理在该时间点之前进行的提交。这种强大的能力被称为“时间旅行”或数据版本控制，在任何情况下都可以成为救命稻草。有关更多信息，请阅读博客文章“为大规模数据湖介绍三角洲时间旅行”，或参阅三角洲湖时间旅行文档。

Data Lineage and Debugging

As the definitive record of every change ever made to a table, the Delta Lake transaction log offers users a verifiable data lineage that is useful for governance, audit and compliance purposes. It can also be used to trace the origin of an inadvertent change or a bug in a pipeline back to the exact action that caused it. Users can run DESCRIBE HISTORY to see metadata around the changes that were made.
作为对表格进行的每一次更改的最终记录，Delta Lake事务日志为用户提供了一个可验证的数据沿袭，这对治理、审计和合规性非常有用。它还可以用来追踪管道中意外更改或错误的来源，追溯到导致更改的确切操作。用户可以运行DESCRIBE HISTORY来查看所做更改的元数据。

Delta Lake Transaction Log Summary

In this blog, we dove into the details of how the Delta Lake transaction log works, including:

What the transaction log is, how it’s structured, and how commits are stored as files on disk.
How the transaction log serves as a single source of truth, allowing Delta Lake to implement the principle of atomicity.
How Delta Lake computes the state of each table - including how it uses the transaction log to catch up from the most recent checkpoint.
Using optimistic concurrency control to allow multiple concurrent reads and writes even as tables change.
How Delta Lake uses mutual exclusion to ensure that commits are serialized properly, and how they are retried silently in the event of a conflict.

https://delta.io/?utm_source=delta-blog