[ ai ]auto-mlsha 1168651measured-in-repo

It drives thewhole lifecycle,under guard.

Upload, EDA, natural-language SQL, preprocessing, training, experiments, deployment. Each phase is an agent that proposes Python, runs it under human approval, and repairs its own failures. It never executes model-generated code in-process.

~137k
LOC TypeScript: ~2,208
automated tests: 23
SQL migrations: 382
React components

every figure counted from the source tree at the pinned commit

[ ai ]auto-mlsha 1168651measured-in-repo

The path from a messy CSV to a deployed model is hand-stitched.

Getting a raw, messy dataset to a model in production normally means wiring together pandas, scikit-learn, a pile of notebooks, an experiment tracker, and serving code by hand. Every handoff is manual, every step is a place to lose state, and nothing keeps a record of why a given transform happened.

The question this project asks is narrow and concrete. Can an LLM agent drive that entire lifecycle, from exploration through NL querying, preprocessing, training, experiments, and deployment, while a human stays in the loop at every commit and no model-generated Python ever runs in-process? That last constraint is the hard part. An agent that writes and runs arbitrary code is a remote-code-execution surface by construction, so the whole design has to treat the model’s output as hostile.

§ scope · backend 54k LOC · frontend 83k LOC · 246 *.test.ts(x) files · 13-tool MCP registry

[ ai ]auto-mlsha 1168651measured-in-repo

One agent per phase. It proposes, you approve, the sandbox runs.

Every ML stage is an LLM agent that proposes an action, generates Python into a notebook cell, and validates the result through MCP tool calls, gated on operator approval. The agent never touches a Python process directly. It writes code into a cell, then a hardened Docker container with a persistent Jupyter kernel executes it. Kernel state survives across cells, so a scaler fitted early is still in memory many cells later, the same way a human’s notebook session works.

When a cell fails, the agent does not silently emit bad output. The failure feeds a bounded auto-repair loop: the error context goes back into code generation, the agent rewrites the cell, and it re-runs, up to a hard cap, after which control returns to the operator. Validation is its own state in the graph, not an afterthought, so “the code ran” and “the code did the right thing” get checked separately.

The control surface is the Model Context Protocol. 13 tools (get_dataset_profile, run_cell, edit_cell, search_documents and the rest) are exposed to the model over the official MCP SDK with an InMemoryTransport, after schema sanitization strips internal fields the model has no business seeing.

[ ai ]auto-mlsha 1168651transcribed-from-source

The preprocessing engine is a guarded state machine.

The LangGraph runtime walks a guarded path: context, plan, generate, execute, validate, approve, commit. Failure routes back to code generation; success routes through human approval before anything commits. This is the verified transition function in preprocessingRuntime.ts, drawn.

State machine transcribed from backend/src/services/llm/langgraph/preprocessingRuntime.ts:91-161 (resolvePreprocessingTransition); sandbox flags from services/container/dockerBuilder.ts, both covered by tests. Solid green is the success path; amber dashed is the bounded auto-repair loop back to generate_code.

§ nl-to-sql · schema context -> planning -> generation (compact fallback) -> read-only validation · pipeline.ts:394-441

[ ai ]auto-mlsha 1168651design-notes

Roads taken, and the ones honestly not finished.

The interesting decisions are the boundaries this project drew on purpose, and the ones it admits it never crossed.

kept

Split persistence on purpose

File-backed storage for project JSON, dataset bytes, and model artifacts; Postgres for auth, embeddings, notebooks, and workflows. A documented pragmatic choice: big binary blobs do not belong in a relational store, and metadata does not belong in loose files.

CLAUDE.md · Architecture.md

kept

Sandbox is untrusted by default

Every execution container ships with --network none, a read-only root filesystem, a non-root user, memory and CPU caps, and an add-host rule that blackholes SSRF to the host even if networking is later turned on. The model’s code is treated as hostile from the first byte.

dockerBuilder.ts · verified by tests

honest loss

The LangGraph engine is still a scaffold

The preprocessing state machine names its own compiled graph preprocessing-langgraph-scaffold. Production preprocessing actually runs through services/workflows. That not-yet-migrated seam is left visible rather than papered over in the README.

preprocessingRuntime.ts:272 · honest boundary

honest loss

No unified benchmark runner exists yet

The repo defines benchmark suites against public datasets (Titanic, Ames Housing, Credit Card Fraud, Adult Income) but commits no measured results. The design notes pressure-test the benchmark concept, reject a naive Jaccard agreement metric, and conclude the runner is unbuilt. The honesty is the point.

docs/expo-benchmark-design-notes.md

[ ai ]auto-mlsha 1168651counted-at-sha

What is actually measured in the repo.

No model-quality numbers are committed, so this case study does not claim any. The receipts below are all counted from the source tree at the pinned SHA.

~2,208

test cases

1,229 backend + 908 frontend + 71 landing, across ~246 *.test.ts(x) files

~137k

lines of app code

~54k backend TS + ~83k frontend TS/TSX, excluding tests

382

React components

plus 39 Zustand stores and 39 custom hooks

SQL migrations

001_init through 021, sequential schema evolution

MCP tools

official SDK, InMemoryTransport, schema-sanitized for the model

NL-to-SQL phases

schema context, plan, generate, read-only validate

The sandbox, flag by flag

backend/src/services/container/dockerBuilder.ts · every flag in one auditable function, covered by dockerBuilder.test.ts

--network noneno egress by default; model code cannot phone home

--read-onlyimmutable root filesystem; writes only to scoped tmpfs

--user sandboxnon-root execution, dropped privileges

--memory / --cpushard resource caps per container

--tmpfs /tmp:nosuidwritable scratch that cannot escalate

-v datasets:/datasets:rodata mounted read-only; code cannot corrupt the source

--add-host host.docker.internal:0.0.0.0SSRF to the host blackholed even if networking is later overridden

the number I will not invent

There is no committed “7x faster” benchmark.

The resume credits this project with deploying models 7x faster than manual Jupyter. I believe that from building it, but the repo does not prove it: the benchmark quality gate is literally set to 'tbd' in expo-public-p0.v1.json and no run artifacts are committed. So this page reports only what the tree actually holds: the test count, the LOC, the migrations, the MCP tools, the exact sandbox flags. The time-to-model claim stays out.

That is the deliberate cost of honesty here. The system is real and the engineering is auditable; the headline speed number is not yet earned, so I do not show it.

[ ai ]inline demo

Step through one agentic run.

The playground replays a real preprocessing run: the guarded transition function above, the bounded auto-repair loop firing on a failed cell, and the approval gate, all from committed repo material.

open the playground

[ ai ]real product captures

Six phases, one workspace.

auto-ml data upload phase with inferred schema — Data upload + schema inference

auto-ml agentic exploratory data analysis dashboard — Agentic EDA dashboard

auto-ml natural language to SQL, validated read-only — English to SQL, validated read-only

auto-ml preprocessing phase with approval gates — Preprocessing with approval gates

auto-ml model training inside a sandboxed kernel — Training in a sandboxed kernel

auto-ml experiment tracking and leaderboard — Experiment tracking + leaderboard

view the repository architecture wiki all work

auto-ml · agentic AutoML platform · pinned at sha 1168651 on main · GPL-3.0
all figures and receipts are sourced from the repository at that SHA; nothing on this page is invented.

[ loading ]

§ rendering route

It drives thewhole lifecycle,under guard.

~137k
LOC TypeScript: ~2,208
automated tests: 23
SQL migrations: 382
React components

every figure counted from the source tree at the pinned commit