Crash recovery in a from-scratch database

I wrote a relational database engine in C++20 from the SQL text down to the page on disk, and the part that taught me the most was the part you only see when the process dies mid-write: crash recovery. This is how the write-ahead log and the ARIES-style three-phase replay keep ~17.3k lines of engine honest after a power cut.

Why recovery is the real test

Most database side projects are a SQL string parsed into a query over an in-memory map. They never touch the hard problem, which shows up the moment you persist: a transaction can be half-written when the machine loses power, and the next startup has to decide what survived. Entropy commits a transaction by flushing its log, not its data pages, so a crash always leaves the on-disk state somewhere between two consistent points. Recovery is the code that walks it back to one.

The contract I held myself to is the standard one, durability and atomicity under failure: if COMMIT returned, the change is on disk after recovery; if it did not, no trace of the change is. I get there with write-ahead logging and the three passes ARIES defines.

Write-ahead logging: the rule that makes it work

The whole scheme rests on one ordering rule. Before any dirty data page reaches disk, the log record that describes the change must already be there. Stated as an inequality over log sequence numbers, for any page $P$ written to disk:

\text{LSN}_{\text{flushed log}} \;\ge\; \text{pageLSN}(P)

Each log record carries a monotonically increasing LSN, and each page stamps the LSN of the last change applied to it. Keep the flushed-log frontier ahead of every dirty page and recovery always has a written description of anything it might find half-applied. Commit then costs one log flush, not a scatter of random page writes, which is also why the batch-insert path is competitive: I pay one sequential write on commit instead of many.

Three passes: analysis, redo, undo

On startup, recovery reads the log forward once to learn what was happening, replays history forward to rebuild the exact pre-crash page state, then walks backward to roll back whatever never committed. The order is not negotiable: you cannot know what to undo until you have first redone everything, including the work of the losers.

1 · analysis: Scan forward from the last checkpoint, rebuilding the transaction table (who was live) and the dirty-page table (what might be stale on disk). The output is a worklist, not a change.
2 · redo: Replay every logged change from the earliest dirty-page recLSN forward, including changes by transactions that never committed. Redo is idempotent: it reapplies a record only when $\text{pageLSN}(P) < \text{LSN}_{\text{record}}$ , so replaying the same log twice is safe.
3 · undo: Roll back every loser transaction in reverse LSN order, writing a compensation log record (CLR) for each undone action so that a crash during recovery never re-undoes work already undone.

The compensation records are the subtle part and the reason ARIES is worth copying instead of inventing. A CLR points past the action it undoes, so a second crash mid-undo resumes where it left off rather than starting over. Recovery is itself crash-safe. My implementation logs each phase transition and keeps replay and undo counters, so a recovered startup prints exactly what it did rather than recovering silently and hoping.

void RecoveryManager::recover() {
  analysis();            // rebuild txn table + dirty page table
  std::size_t redone  = redo();   // replay forward, idempotent on pageLSN
  std::size_t undone  = undo();   // roll back losers, write CLRs
  log_phase_summary(redone, undone);
}

src/transaction/recovery.cpp: the three phases, logged, with counters (not a stub)

What recovery runs on top of

Recovery only works because the layers under it are honest about where bytes live. Tuples sit in slotted pages; pages live in a B+ tree and an extendible hash index; everything reaches disk through an LRU buffer pool over a single disk manager. The pageLSN that the WAL rule depends on is a field in the slotted page header, so the storage format and the recovery algorithm are the same design seen from two sides.

Entropy architecture: SQL text flows through parser and binder into the cost-based optimizer, then Volcano executors, then the transaction layer (MVCC, lock manager, WAL, recovery) and the storage engine (slotted pages, B+ tree, buffer pool, disk manager). — The layered engine. Recovery lives in the transaction tier (WAL, log records, recovery) and reads/writes through the same buffer pool and slotted pages every query uses. Source: entropy/docs/diagrams/architecture-dark.svg, recolored to the site palette.

What it costs, measured against SQLite

Logging on commit is not free, and a portfolio benchmark that only shows wins is not a benchmark. I ran entropy and system SQLite through the identical Google Benchmark harness (ENTROPY_BENCH_COMPARE_SQLITE=ON) on the same machine. The commit-once design pays off on batch insert and loses on point select, and both are in the chart.

workloadentropy / SQLite

Insert batch (1,000 rows, single txn)940,970 ns vs 1,048,270 ns
0.90x
Insert batch (10,000 rows, single txn)9,693,950 ns vs 6,992,260 ns
1.39x
Point select (1,000 rows)46,041 ns vs 23,020 ns
2.00x
Point select (10,000 rows)459,723 ns vs 179,783 ns
2.56x

ratio = entropy / SQLite, lower is better; the amber rule is SQLite at 1.0x. Single repetition, system libsqlite3, same harness. Source: public/benchmarks/entropy.json (SHA 901ddb7).

Entropy comes in 11% faster than SQLite on the 1,000-row batch insert (0.90×, lower is better), then falls behind on point selects at 2.0-2.6×. That loss is a direct consequence of an architecture decision, not a bug: the Volcano iterator model pulls one tuple at a time up a tree of operators, which is wonderful for composing executors and poor for a single keyed lookup where SQLite's tighter path wins. I chose composability and the benchmark shows the bill.

Reading the loss honestly

A point select touches few rows, so the per-tuple overhead of the iterator interface dominates and there is no batch to amortize it against. A batch insert is the opposite: thousands of rows under one commit, so the single log flush is shared across all of them and the sequential-write design pays for itself. The same engine is faster and slower than the same baseline depending on the workload, and saying so is the point.

Verifying it yourself

The recovery path is exercised by the largest test file in the repo (the transaction suite alone is 84 of the 355 GoogleTest cases), and the benchmark numbers above regenerate from a committed summarizer that recomputes the entropy/SQLite ratio, so the README table is a derived artifact rather than a typed-in claim.

Build with the SQLite comparison on: cmake -DENTROPY_BENCH_COMPARE_SQLITE=ON.
Run the benchmark target; it emits Google Benchmark JSON per workload.
Run scripts/bench/summarize.py to fold the JSON into the CSV the chart above reads from.

The 12 ADRs in DESIGN.md record the roads I did not take: the hand-written recursive-descent parser over a library (ADR-004), MVCC snapshot isolation over strict two-phase locking (ADR-003), and the Volcano model over vectorized execution with its costs stated up front (ADR-005). Read the source at github.com/ShreeChaturvedi/entropy.

Why recovery is the real test

Write-ahead logging: the rule that makes it work

\text{LSN}_{\text{flushed log}} \;\ge\; \text{pageLSN}(P)

Three passes: analysis, redo, undo

1 · analysis: Scan forward from the last checkpoint, rebuilding the transaction table (who was live) and the dirty-page table (what might be stale on disk). The output is a worklist, not a change.
2 · redo: Replay every logged change from the earliest dirty-page recLSN forward, including changes by transactions that never committed. Redo is idempotent: it reapplies a record only when $\text{pageLSN}(P) < \text{LSN}_{\text{record}}$ , so replaying the same log twice is safe.
3 · undo: Roll back every loser transaction in reverse LSN order, writing a compensation log record (CLR) for each undone action so that a crash during recovery never re-undoes work already undone.

void RecoveryManager::recover() {
  analysis();            // rebuild txn table + dirty page table
  std::size_t redone  = redo();   // replay forward, idempotent on pageLSN
  std::size_t undone  = undo();   // roll back losers, write CLRs
  log_phase_summary(redone, undone);
}

src/transaction/recovery.cpp: the three phases, logged, with counters (not a stub)

What recovery runs on top of

What it costs, measured against SQLite

workloadentropy / SQLite

Insert batch (1,000 rows, single txn)940,970 ns vs 1,048,270 ns
0.90x
Insert batch (10,000 rows, single txn)9,693,950 ns vs 6,992,260 ns
1.39x
Point select (1,000 rows)46,041 ns vs 23,020 ns
2.00x
Point select (10,000 rows)459,723 ns vs 179,783 ns
2.56x

ratio = entropy / SQLite, lower is better; the amber rule is SQLite at 1.0x. Single repetition, system libsqlite3, same harness. Source: public/benchmarks/entropy.json (SHA 901ddb7).

Reading the loss honestly

Verifying it yourself

Build with the SQLite comparison on: cmake -DENTROPY_BENCH_COMPARE_SQLITE=ON.
Run the benchmark target; it emits Google Benchmark JSON per workload.
Run scripts/bench/summarize.py to fold the JSON into the CSV the chart above reads from.

Why recovery is the real test

Write-ahead logging: the rule that makes it work

Three passes: analysis, redo, undo

What recovery runs on top of

What it costs, measured against SQLite

Reading the loss honestly

Verifying it yourself

Onemoment

Why recovery is the real test

Write-ahead logging: the rule that makes it work

Three passes: analysis, redo, undo

What recovery runs on top of

What it costs, measured against SQLite

Reading the loss honestly

Verifying it yourself