A TokuDB Stall Caused by a Big Transaction and How It was Fixed

One of our customers sometimes observed lots of simple insertions taking far longer than expected to complete. Usually these insertions completed in milliseconds, but the insertions sometimes were taking hundreds of seconds. These stalls indicated the existence of a serialization bug in the Fractal Tree index software, so the hunt was on. We found that these stalls occurred when a big transaction was committing and the Fractal Tree index software was taking a checkpoint. This problem was fixed in both TokuDB 7.0.3 and TokuMX 1.0.3. Please read on as we describe some details about this bug and how we fixed it. We describe some of the relevant Fractal Tree index algorithms first.

What is a Big Transaction?

Each transaction builds a rollback log as it performs Fractal Tree index operations. The rollback log is maintained in memory until it gets too big and is spilled to the TokuDB rollback file. We define a small transaction as one whose rollback log resides in memory. We define a big transaction as one whose rollback log is spilled to the rollback file.

When the transaction commits or aborts, the entire rollback log for the transaction is traversed and for each rollback entry the commit or abort action for the rollback entry is executed. For example, it the transaction is committing and the rollback entry is an insertion, then the Fractal Tree index software does nothing since we implicitly commit the insertion. If the transaction is aborting an insertion, then the Fractal Tree index software sends an abort message into the appropriate Fractal Tree index. The time to traverse the rollback log depends on the number of operations that the transaction logged. This time can be huge.

What is a Checkpoint?

Each transaction also appends recovery entries to the recovery log as it performs Fractal Tree index operations. When recovering from a crash, the recovery log is traversed and its log entries are executed. Since we don’t want to process the recovery log from the beginning of time, we periodically checkpoint the open Fractal Tree indexes and write checkpoint log entries to the recovery log. A checkpoint creates a new version of the Fractal Tree index and persists it to storage. The checkpoint log entries allow crash recovery to only have to replay operations that occurred since the last checkpoint, not since the beginning of time.

A checkpoint is split into two phases: the begin phase and the end phase. The begin checkpoint phase logs the state of the open Fractal Tree indexes and the live transactions. It also marks the blocks that need to be written by the end checkpoint phase. The details of checkpoints will not be covered here since they are way beyond the topic of this blog.

A reader writer lock is used to serialize transaction commits, aborts and the checkpoint begin phase. Transaction commits and aborts take a read lock on this lock. Checkpoints take a write lock on this lock. We use a fair reader writer lock so that checkpoints are not starved forever in the face of lots of transaction commits or aborts.

The Bug

Here is how the stalls occur.

A big transaction grabs a read lock on the checkpoint lock when it wants to commit or abort.
Other transactions can still grab a read lock on the checkpoint lock and commit. So far, so good.
A checkpoint attempts to grab a write lock in the checkpoint lock. Since there is a read lock active, the write lock attempt is blocked.
When other transactions try to grab a read lock on the checkpoint lock, they are blocked because there is a write lock pending.
Now, all transaction commits and aborts are serialized until the big transaction completes.

The Bug Fix

We split the checkpoint lock into a low priority reader writer lock and a high priority reader writer lock. A big transaction grabs a read lock on the low priority reader writer lock. A small transaction grabs a read lock on the high priority reader writer lock. The checkpoint first grabs a write lock on the low priority lock. It then grabs a write lock on the high priority lock.

Zardosht and Leif tell me that this design pattern is a special case of a graded lock that supports multiple priority levels. We use two levels here.

Here is how the stalls are avoided.

A big transaction grabs a read on the low priority checkpoint lock.
Other big transactions can still grab a read lock on the low priority checkpoint lock.
Other small transactions can grab a read lock on the high priority checkpoint lock as well.
A checkpoint attempts to grab a write lock on the low priority checkpoint. Since there is a read lock active, the write lock attempt is blocked.
Other small transactions can continue to grab a read lock on the high priority checkpoint lock since there are no write locks.
Eventually, the big transaction releases the low priority lock, the checkpoint thread grabs it, and then the checkpoint thread tries to grab the high priority lock.
This will serialize the small transactions while the checkpoint begin runs, but we expect checkpoint begin to be relatively fast.

Consequences

We give priority to small transactions over big transactions in this design. While big transactions are running through their rollback logs, small transactions continue to commit or abort.

A TokuDB Stall Caused by a Big Transaction and How It was Fixed

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112