[ sys ]fast-mnist-nnsha f8efce9measured-in-repo
A digit net I taught the CPU to run fast, from scratch.
I wrote the dense linear algebra and backprop for an MNIST multilayer perceptron in C++17 with no BLAS, Eigen, or PyTorch, then hand-vectorized the hot paths and proved the optimizations pay off with a reproducible benchmark suite.
- [ fast-mnist-nn ], noun
- pillar
- systems ◆
- lang
- C++17
- license
- MIT
- size
- ~3,387 LOC
- baseline
- own scalar build
- harness
- Google Benchmark
training speedup
SIMD intrinsics · 64-byte aligned memory · blocked GEMM tiling
inference throughput
784-30-10 MLP, classify path (serial SIMD)
compile-time SIMD dispatch
one kernel, four arch paths + scalar tail
Headline figures are from the resume. The reproducible repo numbers that back them appear in the benchmark below, each cited to sha f8efce9.
01Problem
Train and run a digit net fast on the CPU, with nothing borrowed.
Most from-scratch neural nets stop at correctness: a few nested loops over a matrix, a sigmoid, and a training curve that eventually goes down. The harder problem is making the same math fast on a real CPU without reaching for a library that already solved it. I set the constraint up front: no BLAS, no Eigen, no PyTorch. The dense linear algebra and the backprop are mine to write, and the speed is mine to earn.
That turns one project into two questions. Does the network learn MNIST correctly, and does the hand-written math actually run fast enough to matter? The second question only counts if I can prove it, so the benchmark harness is part of the project, not an afterthought.
02Approach
Three moves: align the memory, block the multiply, fuse the step.
Speed here comes from three decisions that compound, each one a deliberate piece of the hot path rather than a compiler flag I flipped on and hoped.
- i
Align the memory.
Every Matrix rounds its row stride up to a 64-byte boundary and allocates through
posix_memalign, so every row starts aligned and the SIMD inner loop uses aligned loads instead of the slower unaligned path.src/Matrix.cpp:34 - ii
Block the multiply.
The multiply walks the contraction dimension in 128-element tiles to keep hot data in cache, and transposes the right operand once so the inner dot reads contiguous memory, then dispatches at compile time to AVX-512, AVX2, NEON, or a scalar tail, with FMA when the target advertises it.
src/Matrix.cpp:333 - iii
Fuse the step.
The forward step fuses matrix-vector multiply, bias add, and sigmoid into one pass per weight row. The whole learn step runs forward, backprop, and the SGD update over function-local static scratch buffers, so the training hot path does zero per-step allocation.
src/NeuralNet.cpp:185
03Architecture
One data path from aligned bytes to a trained step.
The figure follows a single matrix through the optimized path. Aligned rows feed the blocked multiply, the inner dot resolves to one SIMD kernel at compile time, and the scalar result lands in the fused network step. The accent marks only the traversed path; everything else is structure.
src/Matrix.cpp and src/NeuralNet.cpp at sha f8efce9. The highlighted SIMD lane (AVX-512) is illustrative; the real lane is chosen by the target at compile time.The library is layered, not a single blob.
Each layer consumes the one beneath it. The CLI drives training and evaluation over a PGM parser. The network owns the fused kernels. The matrix owns aligned storage and the SIMD dispatch. The harness turns benchmark JSON into the committed CSV and charts. About 3,387 lines hold the whole thing.
src layout · top consumes the layer beneath it
apps/fast_mnist_cli.cpptraining + eval driverP2 PGM parser · binary image cache
NeuralNet.{h,cpp}fused MLP kernels, static scratchgemv+bias+sigmoid · output δ · Wᵀ·δ · SGD
Matrix.{h,cpp}dense linear algebra + SIMD dispatchaligned alloc · blocked dot · transpose · axpy
benchmarks/ · tools/run_benchmarks.pyreproducible measurement harnessGoogle Benchmark JSON → CSV → SVG
04Tradeoffs · road not taken
Two honest costs the benchmarks made me admit.
The data forced two decisions I would rather not have needed, and both stay in the repo as written rather than papered over.
OpenMP only on big matrix ops, never on the network
bench_summary.csvThreading the small operations made them slower, not faster. With OpenMP on, axpy/128 runs about 6.9× slower and transpose/128 about 4.3× slower, because the fork-join overhead dwarfs the work. So I gate the pragmas behind a size threshold and leave the network's learn and classify paths as serial SIMD. The near-flat learn throughput across all three configs confirms OpenMP never touches them.
Static scratch buffers buy speed, cost thread-safety
src/NeuralNet.cppThe training and inference paths reuse function-local static buffers to kill per-step allocation. Those statics are shared across every NeuralNet instance, so the win comes at the price of multi-instance and multi-thread correctness. For a single trainer on one thread it is the right call, so I document the reentrancy caveat rather than pretend it isn't there.
05Benchmark · vs scalar baseline
Wins and losses, side by side.
The named baseline is the project's single-threaded scalar build (OpenMP off, no -march=native), run side by side with the native and openmp+native configs in the same Google Benchmark harness. There is no BLAS or Eigen comparison on purpose, since the whole point was to write the math myself. On the fully optimized matrix path that adds up to the resume's 20× training speedup; the committed CSV below is the reproducible evidence underneath it.
docs/benchmarks/bench_summary.csv at sha f8efce9. Single-repetition runs; lower ns/op is better.The large dot product carries the win: ~3.5× faster at 256×256, with transpose and axpy near 2× at size 1024. The two faint bars below the line are real, not rounding errors, and they are the most useful thing in the chart: parallelism has a crossover point, and below it the overhead wins. Recording that is the difference between a benchmark and a brochure.
[ sys ]playground
Draw a digit, watch the net decide.
The playground hosts a small interactive version of this network: sketch a digit on a 28×28 grid and the forward pass scores all ten classes live, with the confidence bars updating as you draw. It runs the same forward math this case study describes.
Read the source.
The SIMD dispatch, the blocked multiply, the fused kernels, and the benchmark harness are all in the repo, pinned at the commit every number on this page is measured from. Cross-platform CI runs the Catch2 suite on Linux, macOS, and Windows.