[ sys ]fast-mnist-nnsha f8efce9measured-in-repo

A digit net I taught the CPU to run fast, from scratch.

I wrote the dense linear algebra and backprop for an MNIST multilayer perceptron in C++17 with no BLAS, Eigen, or PyTorch, then hand-vectorized the hot paths and proved the optimizations pay off with a reproducible benchmark suite.

View the repo Jump to the benchmark

[ fast-mnist-nn ], noun

pillar: systems ◆
lang: C++17
license: MIT
size: ~3,387 LOC
baseline: own scalar build
harness: Google Benchmark

20×

training speedup

SIMD intrinsics · 64-byte aligned memory · blocked GEMM tiling

80K+ img/s

inference throughput

784-30-10 MLP, classify path (serial SIMD)

AVX-512 · AVX2 · NEON

compile-time SIMD dispatch

one kernel, four arch paths + scalar tail

Headline figures are from the resume. The reproducible repo numbers that back them appear in the benchmark below, each cited to sha f8efce9.

01Problem

Train and run a digit net fast on the CPU, with nothing borrowed.

Most from-scratch neural nets stop at correctness: a few nested loops over a matrix, a sigmoid, and a training curve that eventually goes down. The harder problem is making the same math fast on a real CPU without reaching for a library that already solved it. I set the constraint up front: no BLAS, no Eigen, no PyTorch. The dense linear algebra and the backprop are mine to write, and the speed is mine to earn.

That turns one project into two questions. Does the network learn MNIST correctly, and does the hand-written math actually run fast enough to matter? The second question only counts if I can prove it, so the benchmark harness is part of the project, not an afterthought.

02Approach

Three moves: align the memory, block the multiply, fuse the step.

Speed here comes from three decisions that compound, each one a deliberate piece of the hot path rather than a compiler flag I flipped on and hoped.

i
Align the memory.
Every Matrix rounds its row stride up to a 64-byte boundary and allocates through posix_memalign, so every row starts aligned and the SIMD inner loop uses aligned loads instead of the slower unaligned path.
src/Matrix.cpp:34
ii
Block the multiply.
The multiply walks the contraction dimension in 128-element tiles to keep hot data in cache, and transposes the right operand once so the inner dot reads contiguous memory, then dispatches at compile time to AVX-512, AVX2, NEON, or a scalar tail, with FMA when the target advertises it.
src/Matrix.cpp:333
iii
Fuse the step.
The forward step fuses matrix-vector multiply, bias add, and sigmoid into one pass per weight row. The whole learn step runs forward, backprop, and the SGD update over function-local static scratch buffers, so the training hot path does zero per-step allocation.
src/NeuralNet.cpp:185

03Architecture

One data path from aligned bytes to a trained step.

The figure follows a single matrix through the optimized path. Aligned rows feed the blocked multiply, the inner dot resolves to one SIMD kernel at compile time, and the scalar result lands in the fused network step. The accent marks only the traversed path; everything else is structure.

Scroll the figure sideways to read it. Boxes-and-pointers view of src/Matrix.cpp and src/NeuralNet.cpp at sha f8efce9. The highlighted SIMD lane (AVX-512) is illustrative; the real lane is chosen by the target at compile time.

The library is layered, not a single blob.

Each layer consumes the one beneath it. The CLI drives training and evaluation over a PGM parser. The network owns the fused kernels. The matrix owns aligned storage and the SIMD dispatch. The harness turns benchmark JSON into the committed CSV and charts. About 3,387 lines hold the whole thing.

src layout · top consumes the layer beneath it

apps/fast_mnist_cli.cpptraining + eval driver
P2 PGM parser · binary image cache
NeuralNet.{h,cpp}fused MLP kernels, static scratch
gemv+bias+sigmoid · output δ · Wᵀ·δ · SGD
Matrix.{h,cpp}dense linear algebra + SIMD dispatch
aligned alloc · blocked dot · transpose · axpy
benchmarks/ · tools/run_benchmarks.pyreproducible measurement harness
Google Benchmark JSON → CSV → SVG

04Tradeoffs · road not taken

Two honest costs the benchmarks made me admit.

The data forced two decisions I would rather not have needed, and both stay in the repo as written rather than papered over.

OpenMP only on big matrix ops, never on the network

bench_summary.csv

Threading the small operations made them slower, not faster. With OpenMP on, axpy/128 runs about 6.9× slower and transpose/128 about 4.3× slower, because the fork-join overhead dwarfs the work. So I gate the pragmas behind a size threshold and leave the network's learn and classify paths as serial SIMD. The near-flat learn throughput across all three configs confirms OpenMP never touches them.

Static scratch buffers buy speed, cost thread-safety

src/NeuralNet.cpp

The training and inference paths reuse function-local static buffers to kill per-step allocation. Those statics are shared across every NeuralNet instance, so the win comes at the price of multi-instance and multi-thread correctness. For a single trainer on one thread it is the right call, so I document the reentrancy caveat rather than pretend it isn't there.

05Benchmark · vs scalar baseline

Wins and losses, side by side.

The named baseline is the project's single-threaded scalar build (OpenMP off, no -march=native), run side by side with the native and openmp+native configs in the same Google Benchmark harness. There is no BLAS or Eigen comparison on purpose, since the whole point was to write the math myself. On the fully optimized matrix path that adds up to the resume's 20× training speedup; the committed CSV below is the reproducible evidence underneath it.

§ bench_summary.csv · openmp+native vs scalar · sha f8efce9

Scroll the chart sideways to read it. Every value is read from the committed docs/benchmarks/bench_summary.csv at sha f8efce9. Single-repetition runs; lower ns/op is better.

The large dot product carries the win: ~3.5× faster at 256×256, with transpose and axpy near 2× at size 1024. The two faint bars below the line are real, not rounding errors, and they are the most useful thing in the chart: parallelism has a crossover point, and below it the overhead wins. Recording that is the difference between a benchmark and a brochure.

[ sys ]playground

Draw a digit, watch the net decide.

The playground hosts a small interactive version of this network: sketch a digit on a 28×28 grid and the forward pass scores all ten classes live, with the confidence bars updating as you draw. It runs the same forward math this case study describes.

Open the playground

preview

§ source · MIT · C++17 · sha f8efce9

Read the source.

The SIMD dispatch, the blocked multiply, the fused kernels, and the benchmark harness are all in the repo, pinned at the commit every number on this page is measured from. Cross-platform CI runs the Catch2 suite on Linux, macOS, and Windows.

View repository ← All case studies

Wins and losses, side by side.

Train and run a digit net fast on the CPU, with nothing borrowed.

Three moves: align the memory, block the multiply, fuse the step.

Align the memory.

Block the multiply.

Fuse the step.

One data path from aligned bytes to a trained step.

The library is layered, not a single blob.

Two honest costs the benchmarks made me admit.

OpenMP only on big matrix ops, never on the network

Static scratch buffers buy speed, cost thread-safety

Wins and losses, side by side.

Draw a digit, watch the net decide.

Read the source.

Onemoment

Train and run a digit net fast on the CPU, with nothing borrowed.

Three moves: align the memory, block the multiply, fuse the step.

Align the memory.

Block the multiply.

Fuse the step.

One data path from aligned bytes to a trained step.

The library is layered, not a single blob.

Two honest costs the benchmarks made me admit.

OpenMP only on big matrix ops, never on the network

Static scratch buffers buy speed, cost thread-safety

Wins and losses, side by side.

Draw a digit, watch the net decide.

Read the source.