Loading
[ loading ]
Loading
[ loading ]
[ ai ]auto-mlsha 1168651measured-in-repo
Upload, EDA, natural-language SQL, preprocessing, training, experiments, deployment. Each phase is an agent that proposes Python, runs it under human approval, and repairs its own failures. It never executes model-generated code in-process.
every figure counted from the source tree at the pinned commit
[ ai ]auto-mlsha 1168651measured-in-repo
Getting a raw, messy dataset to a model in production normally means wiring together pandas, scikit-learn, a pile of notebooks, an experiment tracker, and serving code by hand. Every handoff is manual, every step is a place to lose state, and nothing keeps a record of why a given transform happened.
The question this project asks is narrow and concrete. Can an LLM agent drive that entire lifecycle, from exploration through NL querying, preprocessing, training, experiments, and deployment, while a human stays in the loop at every commit and no model-generated Python ever runs in-process? That last constraint is the hard part. An agent that writes and runs arbitrary code is a remote-code-execution surface by construction, so the whole design has to treat the model’s output as hostile.
[ ai ]auto-mlsha 1168651measured-in-repo
Every ML stage is an LLM agent that proposes an action, generates Python into a notebook cell, and validates the result through MCP tool calls, gated on operator approval. The agent never touches a Python process directly. It writes code into a cell, then a hardened Docker container with a persistent Jupyter kernel executes it. Kernel state survives across cells, so a scaler fitted early is still in memory many cells later, the same way a human’s notebook session works.
When a cell fails, the agent does not silently emit bad output. The failure feeds a bounded auto-repair loop: the error context goes back into code generation, the agent rewrites the cell, and it re-runs, up to a hard cap, after which control returns to the operator. Validation is its own state in the graph, not an afterthought, so “the code ran” and “the code did the right thing” get checked separately.
The control surface is the Model Context Protocol. 13 tools (get_dataset_profile, run_cell, edit_cell, search_documents and the rest) are exposed to the model over the official MCP SDK with an InMemoryTransport, after schema sanitization strips internal fields the model has no business seeing.
[ ai ]auto-mlsha 1168651transcribed-from-source
The LangGraph runtime walks a guarded path: context, plan, generate, execute, validate, approve, commit. Failure routes back to code generation; success routes through human approval before anything commits. This is the verified transition function in preprocessingRuntime.ts, drawn.
resolvePreprocessingTransition); sandbox flags from services/container/dockerBuilder.ts, both covered by tests. Solid green is the success path; amber dashed is the bounded auto-repair loop back to generate_code.[ ai ]auto-mlsha 1168651design-notes
The interesting decisions are the boundaries this project drew on purpose, and the ones it admits it never crossed.
File-backed storage for project JSON, dataset bytes, and model artifacts; Postgres for auth, embeddings, notebooks, and workflows. A documented pragmatic choice: big binary blobs do not belong in a relational store, and metadata does not belong in loose files.
CLAUDE.md · Architecture.md
Every execution container ships with --network none, a read-only root filesystem, a non-root user, memory and CPU caps, and an add-host rule that blackholes SSRF to the host even if networking is later turned on. The model’s code is treated as hostile from the first byte.
dockerBuilder.ts · verified by tests
The preprocessing state machine names its own compiled graph preprocessing-langgraph-scaffold. Production preprocessing actually runs through services/workflows. That not-yet-migrated seam is left visible rather than papered over in the README.
preprocessingRuntime.ts:272 · honest boundary
The repo defines benchmark suites against public datasets (Titanic, Ames Housing, Credit Card Fraud, Adult Income) but commits no measured results. The design notes pressure-test the benchmark concept, reject a naive Jaccard agreement metric, and conclude the runner is unbuilt. The honesty is the point.
docs/expo-benchmark-design-notes.md
[ ai ]auto-mlsha 1168651counted-at-sha
No model-quality numbers are committed, so this case study does not claim any. The receipts below are all counted from the source tree at the pinned SHA.
~2,208
test cases
1,229 backend + 908 frontend + 71 landing, across ~246 *.test.ts(x) files
~137k
lines of app code
~54k backend TS + ~83k frontend TS/TSX, excluding tests
382
React components
plus 39 Zustand stores and 39 custom hooks
23
SQL migrations
001_init through 021, sequential schema evolution
13
MCP tools
official SDK, InMemoryTransport, schema-sanitized for the model
4
NL-to-SQL phases
schema context, plan, generate, read-only validate
backend/src/services/container/dockerBuilder.ts · every flag in one auditable function, covered by dockerBuilder.test.ts
the number I will not invent
The resume credits this project with deploying models 7x faster than manual Jupyter. I believe that from building it, but the repo does not prove it: the benchmark quality gate is literally set to 'tbd' in expo-public-p0.v1.json and no run artifacts are committed. So this page reports only what the tree actually holds: the test count, the LOC, the migrations, the MCP tools, the exact sandbox flags. The time-to-model claim stays out.
That is the deliberate cost of honesty here. The system is real and the engineering is auditable; the headline speed number is not yet earned, so I do not show it.
[ ai ]inline demo
The playground replays a real preprocessing run: the guarded transition function above, the bounded auto-repair loop firing on a failed cell, and the approval gate, all from committed repo material.
[ ai ]real product captures






auto-ml · agentic AutoML platform · pinned at sha 1168651 on main · GPL-3.0
all figures and receipts are sourced from the repository at that SHA; nothing on this page is invented.