I Left an AI to Do ML Research. It Crashed Three Times. I Was Ready.
Building production infrastructure for autonomous experimentation
On March 9, 2026, Andrej Karpathy released autoresearch (a system where an AI coding agent autonomously modifies a training script, runs experiments, keeps improvements, discards failures, and repeats). Forever.
His overnight run: 126 experiments. 23 kept improvements. Baseline val_bpb of 0.9979 driven down to 0.9697 (on code he’d already hand-tuned for months).
Within days, the repo had 42,000 stars. Shopify’s CEO pointed it at their Liquid templating engine and got 53% faster rendering from 93 autonomous commits. Fortune magazine called it “The Karpathy Loop.”
I cloned the repo that night.
What Karpathy Built
The entire codebase is five files and ~600 lines of Python:
autoresearch/
├── prepare.py # Data, tokenizer, evaluation (read-only)
├── train.py # Model + training loop (agent modifies this)
├── program.md # Agent instructions (human modifies this)
├── pyproject.toml # Dependencies
└── results.tsv # Experiment logThe core insight is simple: the human programs the agent through a Markdown prompt, and the agent programs the model through code. Two layers of meta-programming.
The agent reads program.md, which tells it to:
Modify
train.pywith an experimental ideaCommit the change
Train for exactly 5 minutes
Check if validation loss (val_bpb) improved
If yes, keep the commit. If no,
git reset.Repeat forever.
One metric. One file to modify. One clear objective. It works because the agent can’t wander (one file, one metric, no ambiguity).
This works. It’s not a toy demo. It’s autonomous ML research.
Then I Tried to Run It
The first evening I watched the agent create a branch, establish a baseline, and start iterating. It tried adjusting learning rates, experimenting with attention mechanisms, testing optimizer configurations. Some worked. Most didn’t. The agent reverted failures and pressed forward.
But as I thought about running this overnight (or for days at a time), I realized the research loop worked. Everything around it didn’t.
His five-file codebase was a proof of concept. Running it reliably (crash recovery, visibility, memory across restarts, hardware monitoring) was a whole different problem.
I didn’t start from scratch. I’d already built an orchestrator.
Symphony: The Orchestrator I Already Had
Before autoresearch existed, I’d been building on top of OpenAI’s Symphony framework. Symphony is a multi-agent orchestration system for coordinating AI agents across tasks. My version of it (this project) started as a way to point AI coding agents at Linear tickets: poll for issues, spin up isolated git workspaces, dispatch agents to work on them, handle retries when things fail, reconcile state when tickets move in Linear.
The architecture already had most of what autoresearch would need. The orchestrator had a tick-based dispatch loop with priority sorting, exponential backoff on failures, stall detection, workspace management with lifecycle hooks, and an abort/signal system for clean shutdown. The Linear integration used GraphQL to fetch candidate issues and track state transitions. Prompts were rendered through Liquid templates with issue context injected.
When I saw autoresearch, I realized I could reuse all of this. The orchestrator already knew how to spawn an agent process, monitor it, restart it on failure, and manage the workspace. I just needed a second mode.
So the codebase has two modes (configured via mode in WORKFLOW.md):
linear: The original mode. Polls Linear for issues in configured states, dispatches agents to work on them in isolated workspaces, handles multi-turn conversations, reconciles tracker state, respects concurrency limits.
autoresearch: The mode this post is about. Runs a single long-lived agent session against Karpathy’s experiment loop, with crash recovery, a dashboard, and optional knowledge persistence.
Both modes share the same orchestrator core, workspace manager, OpenCode client, config system, hardware monitor, and server infrastructure. The autoresearch-specific code (context building, results parsing, training log polling, instruction injection) is layered on top.
Having the orchestrator already built is why I could get Symphonic Autoresearch working quickly. The hard parts (process lifecycle, retry logic, config hot-reload, SSE streaming) were already solved. I just needed to wire them to a different kind of workload.
Running on the DGX Spark
Karpathy ran autoresearch on an H100 (80GB VRAM, 989 TFLOPS bf16). I’m running it on an NVIDIA DGX Spark, which is a very different machine. The Spark is NVIDIA’s desktop AI system (a [not really] Grace Blackwell chip with 128 GB of unified CPU+GPU memory, of which 121.7 GB is available for CUDA, I’ve heard a rumor actually that it’s failed Nvidia Shield gaming hardware). It’s not a datacenter GPU. It sits on my desk.
The unified memory architecture means the CPU and GPU share the same physical memory pool. This is great for fitting larger models (121.7 GB is a lot), but it also means memory pressure from the OS and other processes can affect GPU allocations in ways you wouldn’t see on a dedicated H100. CUDA OOM errors happen at unexpected times, even with plenty of headroom on paper.
The Spark also has compute capability 12.0 (Blackwell), which means Flash Attention 3 works but through a different kernel path than on Hopper (H100). The training script detects this at runtime and routes to kernels-community/flash-attn3 instead of the H100-specific varunneal/flash-attention-3.
Running overnight on a desktop machine (not a managed cloud instance with auto-restart) is exactly the scenario where crash recovery, hardware monitoring, and remote visibility stop being nice-to-haves and become requirements. If the agent OOMs at 3 AM and I’m asleep, the original autoresearch just stops. On a cloud instance you might have a process supervisor or a Kubernetes restart policy. On a DGX Spark on your desk, you need to build that yourself.
First Overnight Run
A few nights later, I woke up at 3 AM and pulled out my phone. Opened http://spark-serial:8080.
The dashboard showed:
● Running 04:23:17
Steps: 342 | Experiments: 47 | Crashes: 3 (recovered)
Best val_bpb: 1.1814Three crashes had happened while I slept. The system recovered from all of them. Experiments continued without intervention.
The crashes were CUDA out-of-memory errors (wild, again the Spark has 121.7 GB of VRAM). But that’s exactly the point: edge cases happen. Memory fragmentation accumulates. Sometimes the agent tries something ambitious and the GPU can’t handle it.
I went back to sleep.
In the morning, I opened my laptop and saw that 47 experiments had run overnight. The results showed a few promising techniques (the agent had searched for recent papers on attention mechanisms, tried approaches, and converged on improvements).
What I Needed to Add
Karpathy’s autoresearch is brilliant at the research loop. But running it as a production system requires solving problems that the five-file proof of concept doesn’t address:
When the agent process dies (OOM, network hiccup, model API timeout), you need more than manual restart. You need exponential backoff, max retry limits, and context injection so the agent knows what happened.
The only interface in Karpathy’s version is a terminal session. If you want to check progress from your phone during dinner, you’re out of luck.
After crashes and restarts, the agent shouldn’t repeat its mistakes. It needs memory across sessions (techniques it already tried, papers it already read, approaches that didn’t work).
Sometimes you want to steer the research direction without stopping everything, async instruction injection.
These are all infrastructure problems, not research problems. I built Symphonic Autoresearch to solve them.
The Infrastructure Layer
The core experiment loop is identical to Karpathy’s (same train.py, same prepare.py, same 5-minute budget, same val_bpb metric). My contribution is entirely in the orchestration layer around it. Some might call it “production.”
Crash Recovery With Context
When the agent process dies, the orchestrator catches it and restarts with exponential backoff:
// Retry logic with exponential backoff
const maxBackoffMs = 300000;
const delayMs = Math.min(
10000 * Math.pow(2, attempt - 1),
maxBackoffMs
);
await sleep(delayMs);But a naive restart puts the agent back at square one. Before each launch, the orchestrator dynamically builds the prompt by reading the current state of the workspace:
const dynamicPrompt = buildAutoresearchPrompt({
programMd: programPrompt,
workspacePath,
crashCount,
lastCrashError,
knowledgeHits,
});The injected context includes:
The full
results.tsvas a markdown tableA summary of what’s been tried
A list of discarded experiments to avoid repeating
The last crash error (truncated to prevent prompt bloat)
After implementing this, crashes went from “overnight experiment lost” to “10-second hiccup, agent picks up where it left off.”
The orchestrator respects a maximum retry limit:
autoresearch:
restart_on_crash: true
max_crash_restarts: 20After 20 consecutive failures, something is broken and the system exits rather than looping forever.
The Dashboard You Can Check From Anywhere
The orchestrator runs an HTTP server with Server-Sent Events for real-time streaming. Open http://your-server:8080 from any browser (phone, tablet, laptop across the network) and see:
A canvas-based scatter plot showing every experiment (green dots for kept experiments, grey for discarded, red X for crashes, a golden line tracking running-best val_bpb, and a dashed goal line at your target metric).
Live training metrics polled from run.log every 2 econds:
step 00953 (100.0%) | loss: 0.9979 | tok/s: 1,847K | MFU: 42.3%
VRAM: 44.2 GB | ETA: 12sGPU utilization, temperature, power draw, and system RAM (gruvbox-dark-hard theme because I’m not a heathen). Every tool call and text output from the agent, with expandable raw JSON for debugging. A results table with commit hash, val_bpb, final training loss, VRAM usage, status, and description for every experiment.
After that, I stopped SSH-ing into the machine. I’d glance at my phone, see “Running 06:42:17, 71 experiments, best val_bpb: 0.9831”, and go back to whatever I was doing.
Knowledge That Survives Restarts
The agent has access to web search through SearXNG. When it searches for “latest advances in transformer pretraining” and finds useful techniques, a search interceptor captures the results and stores them in USearch (a lightweight vector database that runs locally):
Simplified version of the concept (the real implementation handles embedding client injection, text truncation, and BigUint64Array ky management):
class KnowledgeStore {
private index: USearchIndex;
private documents: Map<string, Document>;
async initialize(dimensions: number = 1536): Promise<void> {
this.index = new USearchIndex({
metric: ‘cos’,
dimensions,
connectivity: 16,
});
// Load persisted state if it exists
await this.load();
}
async addDocument(text: string, source: string): Promise<string> {
const id = generateId();
const embedding = await this.getEmbedding(text);
this.index.add(id, embedding);
this.documents.set(id, { id, text, source, timestamp: Date.now() });
return id;
}
async search(query: string, topK: number = 5): Promise<SearchResult[]> {
const queryEmbedding = await this.getEmbedding(query);
const results = this.index.search(queryEmbedding, topK);
return results.map(([id, score]) => ({
document: this.documents.get(id),
score,
}));
}
}On restart, the orchestrator queries this knowledge store and injects relevant research notes into the prompt:
## Research Notes from Previous Sessions
- Differential attention (Microsoft, 2024): Compute attention as difference
of two softmax attention maps, reducing noise in attention heads...
- Schedule-Free optimizer (Meta): Eliminates need for learning rate
schedule by maintaining running averages...This feature is disabled by default. Enabling it requires:
autoresearch:
knowledge_enabled: true
embedding_endpoint: http://your-embedding-server/v1/embeddings
embedding_model: your-model-nameYou need to run an embedding server (not plug-and-play). But if you have the infrastructure, the agent builds institutional memory across sessions. It doesn’t re-read papers it already processed.
Human-in-the-Loop Without Breaking the Loop
The dashboard has a text input at the bottom. Type an instruction (”try cosine annealing for the learning rate schedule”, “focus on attention mechanism changes”), hit send, and the instruction gets queued.
The orchestrator waits for a natural pause (when the agent finishes a reasoning step or an experiment completes), then writes the instruction to a file in the workspace. The agent’s program.md tells it to check for this file between experiments. It reads the instruction, incorporates it into its plan, deletes the file, and continues the loop.
You can guide research direction without ever stopping the agent. Check the dashboard from your phone at dinner, notice the agent is stuck on optimizer tweaks, type “try replacing the attention mechanism with linear attention”, and go back to your conversation.
Architecture
Here’s how it’s organized:
symphonic-autoresearch/
├── src/
│ ├── orchestrator/
│ ├── agent/
│ ├── server/
│ ├── workspace/
│ ├── monitor/
│ ├── knowledge/
│ ├── config/
│ ├── prompt/ # Liquid template rendering
│ ├── logging/
│ ├── tracker/
│ ├── utils/
│ └── types/
├── autoresearch/
│ ├── prepare.py # Same as Karpathy’s (preserved)
│ ├── train.py # Agent modifies this
│ └── program.md # Enhanced agent instructions
├── WORKFLOW.md # YAML configuration + mode selection
└── docker-compose.yml # One-command startupWild how the LLM ASCII drawings break on substack.
Configuration via WORKFLOW.md
All configuration lives in one file with YAML frontmatter:
---
mode: autoresearch
workspace:
root: ~/workspaces
hooks:
after_create: |
git init .
uv sync
timeout_ms: 120000
agent:
max_concurrent_agents: 1
max_turns: 100
stall_timeout_ms: 0
opencode:
command: opencode
model: your-model-here
run_timeout_ms: 0
autoresearch:
program_md: ./autoresearch/program.md
prepare_py: ./autoresearch/prepare.py
train_py: ./autoresearch/train.py
restart_on_crash: true
max_crash_restarts: 20
knowledge_enabled: false
searxng_endpoint: http://your-searxng-instance
server:
port: 8080
---Change opencode.model and rebuild to use different LLMs. The orchestrator doesn’t care whether the agent is Claude, GPT, or a local model served by LM Studio (it talks to any model through OpenCode).
The hooks system lets you run shell scripts at key lifecycle points: after workspace creation, before/after experiments, before cleanup.
What I Kept From Karpathy’s Original
The training code (Karpathy’s GPT implementation, Muon+AdamW optimizer, RoPE embeddings, flash attention). All his work.
The experiment loop (modify, train 5 minutes, evaluate, keep or revert). His design.
The metric (val_bpb, validation bits per byte). His choice, and it’s the right one.
The single-file constraint (the agent only modifies train.py).
The immutable evaluator (prepare.py can’t be modified). This prevents the agent from gaming the metric.
The “never stop” philosophy. No asking permission. No pausing for human approval.
These design decisions are what make autoresearch work. I kept all of them.
The Deployment Story
Karpathy’s setup: uv sync && uv run prepare.py, then start your AI agent manually.
Symphonic Autoresearch:
cp example.WORKFLOW.md WORKFLOW.md
# Edit WORKFLOW.md with your model and preferences
docker compose up --buildThe Docker container includes the NVIDIA PyTorch runtime, Node.js, all Python dependencies, and the compiled orchestrator. It mounts your GPU, your model configuration, and a persistent volume for workspaces. Health checks keep it alive. Everything survives a docker restart.
Live Results
The system is running right now on my DGX Spark. Here’s what the current results.tsv looks like (as of ths writing):
| | Karpathy’s Run (H100) | My Run (DGX Spark) |
|---|---|---|
| Experiments | 126 | 52 |
| Kept improvements | 23 | 22 |
| Crashes | not reported | 1 (recovered) |
| Baseline val_bpb | 0.9979 | 1.3944 |
| Best val_bpb | 0.9697 | 1.1818 |
| Improvement | 2.8% | 15.3% |The baselines aren’t directly comparable. Karpathy started from code he’d already hand-tuned for months on an H100 with 989 TFLOPS bf16. My baseline was a cold start on the Spark (with SDPA attention, before the agent switched to Flash Attention 3), and the Spark’s lower FLOPS means training covers less ground in the same 5-minute window. A higher starting val_bpb leaves more room for improvement, which is why the percentage drop looks bigger.
What’s more interesting than the raw numbers is what the agent found. Some highlights from the results log:
Halving
TOTAL_BATCH_SIZEfrom 524k to 262k dropped val_bpb from 1.3163 to 1.2040 (an 8.5% jump in a single experiment). The smaller batch size lets the Spark’s memory system work more efficiently and fits more training stepto 5 minutes. This also appeared in the original.Tuning
WARMDOWN_RATIOfrom the default 0.5 down to 0.1 (across three experiments) pulled val_bpb from 1.3542 to 1.3206.The one crash was the agent trying to double
DEVICE_BATCH_SIZEfrom 128 to 256, which OOMed. The orchestrator caught it, restarted with context about the crash, and the agent moved on to other experiments.
The agent is still running. These numbers will be different by the time you read this.
What Still Needs Work
Multi-GPU Support
Both projects are single-GPU only. Scaling to multi-GPU would require distributed data parallel or tensor parallel, gradient synchronization across nodes, more complex experiment tracking. This is on my roadmap but not trivial.
Checkpointing
Neither system saves model checkpoints during experiments. If the best run was 50 experiments ago, you’d need to replay git history. I’m planning automatic checkpointing of top-k models.
Knowledge Store Setup Complexity
The vector database feature requires running an embedding server (it’s only plug-and-play if you know what you’re doing). The default configuration has knowledge_enabled: false. This is a higher ceiling for users with the infrastructure, but adds setup burden.
Automated Paper Generation
The agent discovers techniques but doesn’t write them up. A future extension could generate LaTeX summaries of successful experiments (a “research paper” produced automatically from the experiment log).
Where This Goes
Karpathy called autoresearch “programming the research org in Markdown.” The human writes research direction in natural language. The agent translates that into code changes. The GPU evaluates those changes against reality. Infrastructure ensures this loop runs reliably, observably, and continuously.
Karpathy proved the loop works. His overnight run found real improvements on hand-tuned code. Shopify’s CEO proved it generalizes beyond ML (53% faster rendering from autonomous commits).
My contribution is narrower: I proved the loop can be *operationalized*. That it can crash at 3 AM and pick itself back up. That you can watch it from your phone. That knowledge persists across sessions. That a human can steer without stopping the engine.
Karpathy’s vision for what comes next (”asynchronously massively collaborative agents, SETI@home style”) requires this kind of infrastructure. You can’t coordinate a swarm of autonomous researchers if each one dies silently when it hits an error. You can’t learn from a thousand experiments if knowledge doesn’t persist. You can’t debug a distributed system if the only interface is a terminal.
The five-file proof of concept needs to become a production system before it can become a research community. That’s where I’m headed (multi-GPU support, automatic checkpointing, and eventually, coordination across multiple agents running in parallel).
Karpathy’s autoresearch is at github.com/karpathy/autoresearch. Symphonic Autoresearch is the project in which this blog post resides.