back to workC++20 · MIT · ~17k LOC

[ sys ]entropysha 901ddb7measured-in-repo

A SQL database engine, built from scratch down to the disk page.

I wrote the real internals of a disk-backed relational engine in C++20: a recursive-descent parser, a cost-based optimizer, Volcano executors, MVCC and two-phase-locking transactions, a WAL with ARIES recovery, and a B+ tree over an LRU buffer pool. No SQL-parsing or storage libraries.

[ entropy ], noun

1.a from-scratch C++20 relational database engine

2.parse, plan, execute, persist, recover, end to end

3.benchmarked against SQLite on the same harness

view the repo jump to the benchmark

insert throughput: 1M+ rows/s11% faster than SQLite on batch 1k inserts
engine size: ~17.3klines of C++20 across the engine (src + public include)
test coverage: 355GoogleTest cases across storage, txn, parser, exec

lang: C++20
license: MIT
size: ~17.3k LOC / 6 src libs
baseline: SQLite (libsqlite3)
harness: Google Benchmark
verdict: measured-in-repo

the shape of the engine

One SELECT all the way down: every box is a compiled library, every arrow is the path a tuple takes from SQL text to a B+ tree leaf, made physical.

traceone point-select, top to diskpath7 stages · 6 hops

01SQL textapi/
shellC++ API
SELECT * FROM users WHERE id = 42
02Parserparser/
lexerrecursive descentprecedence climb
tokens -> AST
03Binderparser/binder
name resolutiontype check
bound AST (resolved cols + types)
04Optimizeroptimizer/
statisticscost modelindex selector
cost-chosen plan: index scan
05Executorexecution/
index scanfilterproject
Volcano iterator: next() -> tuple
06Buffer poolstorage/buffer_pool
page tableLRU replacerpin / dirty
pin(page) -> frame (LRU on miss)
07B+ treestorage/b_plus_tree
rootinternalleaf
leaf slot -> tuple #42

One SELECT descending the real engine. Each box is a compiled library under src/. Each arrow is the path one tuple takes from SQL text to a B+ tree leaf.

problem

Most database projects are a map with SQL bolted on

Most database portfolio projects are a thin SQL wrapper over an in-memory hash map. I wanted the real internals of a disk-backed relational engine, end to end: parse SQL, plan and cost it, execute it through composable operators, and persist it through a buffer pool with ACID transactions and crash recovery.

The constraint that made it honest: modern C++20, and no SQL-parsing or storage libraries pulled in. If the parser, the B+ tree, and the recovery log were going to exist, I had to write them.

approach

Mirror the database pipeline as libraries with explicit boundaries

Entropy mirrors the logical pipeline as separately compiled C++ libraries, each with an explicit boundary. A hand-written recursive-descent parser and binder feed a cost-based optimizer (statistics, cost model, index selector), which produces Volcano-style iterators for scans, joins, sort, aggregate, filter, and DML.

Underneath sit MVCC and a two-phase-locking lock manager with deadlock detection, a write-ahead log with ARIES three-phase recovery, and a storage engine of slotted pages, a B+ tree, an extendible hash index, and an LRU buffer pool over a disk manager. The benchmark harness uses Google Benchmark and drives the same queries against SQLite when ENTROPY_BENCH_COMPARE_SQLITE=ON.

architecture

Separately compiled libraries under src/, one data path to disk

Layered under src/: parser/, optimizer/, execution/ with about twelve operators, transaction/ (MVCC, lock manager, WAL, recovery), storage/ (pages, B+ tree, hash index, buffer pool, disk manager), catalog/, and a public API in include/entropy/.

The committed architecture diagram from the repo, redrawn on the page's own type. Each box is a separately compiled library under src/, with an explicit boundary. The arrows are the data path a statement follows down to disk.

tradeoffs · road not taken

Twelve ADRs, including the ones I would defend in review

DESIGN.md carries twelve architecture decision records of roads not taken. Here are the four that shaped the engine most. Each names what I chose, what I gave up, and where the reasoning lives.

ADR-004road not taken
Hand-written recursive-descent parser
over hsql / libpg_query
I owned the whole grammar and every edge case, for zero deps and full control.
ADR-003road not taken
MVCC + snapshot isolation
over strict 2PL
More version bookkeeping, in exchange for reads that never block writers.
ADR-005road not taken
Volcano iterator execution
over vectorized / push-based
I traded analytics throughput for composable operators and one-tuple-at-a-time memory.
ADR-007/011road not taken
Nested-loop join first, hash join after
over hash join only
Generality before speed: nested-loop handles any predicate, hash join handles equi-joins fast.

The benchmark below carries the same honesty: Entropy loses to SQLite on point selects, and I kept that result in rather than cherry-pick a win.

benchmark · vs sqlite

11% faster than SQLite on batch inserts; slower on point selects

On a batch of 1,000 single-transaction inserts, Entropy runs at 1M+ rows/s, about 11% faster than SQLite through the same Google Benchmark harness.

On point selects it is 2.0-2.6× slower. Both numbers come from the same run, pinned to sha 901ddb7, against system libsqlite3. The chart anchors on SQLite at 1.0×. The losses sit at full size, not hidden.

Insert batch · 1k rows · single txn0.90×
Entropy 941 µsSQLite 1.05 ms
Insert batch · 10k rows · single txn1.39×
Entropy 9.69 msSQLite 6.99 ms
Point select · 1k rows2.00×
Entropy 46 µsSQLite 23 µs
Point select · 10k rows2.56×
Entropy 460 µsSQLite 180 µs

These are single-repetition results. The harness takes --benchmark_repetitions for p50/p95/p99 when I want them. Source: docs/benchmarks/bench_summary.csv.

proof · the hard parts

The code that proves it is not a toy

Five places where the engine does the work a toy version skips. Each links into the source at the pinned SHA.

demo · playground

Watch one statement descend the engine

The playground steps a real SQL statement through this exact architecture and renders the Entropy-vs-SQLite chart from the committed CSV. Nothing there is fabricated. It runs the project's own material.

interactive · playgroundStep one statement down the engine

Run a real SQL statement through the live walkthrough: watch it parse, plan, and resolve to a B+ tree leaf, beside the Entropy-vs-SQLite chart fed from the committed benchmark CSV.

open the playground

[ loading ]

§ rendering route

A SQL database engine, built from scratch down to the disk page.

[ entropy ], noun

1.a from-scratch C++20 relational database engine

2.parse, plan, execute, persist, recover, end to end

3.benchmarked against SQLite on the same harness

insert throughput

1M+ rows/s11% faster than SQLite on batch 1k inserts

engine size

~17.3klines of C++20 across the engine (src + public include)

test coverage

355GoogleTest cases across storage, txn, parser, exec

lang

C++20

license

MIT

size

~17.3k LOC / 6 src libs

baseline

SQLite (libsqlite3)

harness

Google Benchmark

verdict

measured-in-repo

Most database projects are a map with SQL bolted on

Mirror the database pipeline as libraries with explicit boundaries

Separately compiled libraries under src/, one data path to disk

Twelve ADRs, including the ones I would defend in review

11% faster than SQLite on batch inserts; slower on point selects

The code that proves it is not a toy

Watch one statement descend the engine

Onemoment

Most database projects are a map with SQL bolted on

Mirror the database pipeline as libraries with explicit boundaries

Separately compiled libraries under src/, one data path to disk

Twelve ADRs, including the ones I would defend in review

11% faster than SQLite on batch inserts; slower on point selects

The code that proves it is not a toy

Watch one statement descend the engine