← RETURN TO INTEL
AI research Karpathy autoresearch machine learning agents automation
The Machine That Does Its Own Research

The Machine That Does Its Own Research

Lando Calrissian

What if the most important thing a researcher did was go to sleep?

That’s not a riddle. It’s the proposition at the heart of autoresearch — a project released by Andrej Karpathy on March 8, 2026, that sent the AI community into a quiet, slightly stunned silence followed by a very loud roar. In 48 hours, his announcement had 8.6 million views. By the following week, people were running it overnight on their own machines, waking up to models that were measurably better than the ones they’d left behind.

630 lines of Python. That’s all it took to start automating the scientific method.


Who’s Karpathy and Why Does It Matter That He Built This

Not every idea lands the same way regardless of who says it. Karpathy carries weight.

Born in Bratislava, raised in Toronto, PhD from Stanford under Fei-Fei Li — he was one of the original co-founders of OpenAI, then spent five years as Director of AI and Autopilot Vision at Tesla, reporting directly to Elon Musk. He came back to OpenAI in 2023, left again in 2024, and in 2024 founded Eureka Labs, an AI-native education platform. He has 1.9 million followers on X, and when he posts, the field pays attention.

He invented the term vibe coding. He built Stanford’s CS 231n into one of the most-watched machine learning courses on the internet. He’s earned the right to be taken seriously when he says something matters.

And this, he says, matters a great deal.

“All LLM frontier labs will do this. It’s the final boss battle.”


The Idea Is Simpler Than You Think

Here’s the one-liner from the autoresearch README:

Give an AI agent a training script, a GPU, and a metric. Walk away. Wake up to a better model.

The setup: you have a small GPT-style language model. You want it to train faster, generalise better, learn more efficiently. You have a GPU. You have a night’s sleep.

The agent takes over. It reads the training code, forms a hypothesis about what might improve it, modifies the script, runs a five-minute training pass, measures the result, and decides: keep it or throw it away. Then it does it again. And again. And again.

100 experiments while you sleep. Not 2 or 3 — 100.

The metric it optimises for is val_bpb — validation bits-per-byte. It’s vocabulary-size-independent, which means you can compare architectural changes fairly. It’s also hard to game: the agent can’t memorise its way to a better score. It actually has to find something that generalises.

The results were not subtle. Overnight run 1: 126 experiments, validation loss dropped from 0.9979 to 0.9697. Over two days and roughly 700 experiments, Karpathy’s model achieved 20 additive improvements — an 11% speed-up on a benchmark he’d been hand-tuning for years. The agent caught things he’d missed.


Three Files. One Rule.

The genius of the design is its constraints.

There are three files:

  • prepare.py — fixed. Don’t touch it. This handles constants, data prep, runtime utilities.
  • train.py — the full GPT model, optimizer, and training loop. The agent and only the agent may edit this.
  • program.md — written in plain English by a human. This is your research brief to the agent.

The agent can only modify train.py. Every experiment is exactly five minutes. Every improvement is logged, versioned, and reviewable by a human the next morning. It’s not a black box — it’s a searchable, auditable history of 100 overnight decisions.

That five-minute budget matters more than it seems. It makes every experiment directly comparable. No experiment can balloon into a three-hour job and corrupt the process. The constraint is the feature.


The Karpathy Loop

Janakiram MSV, principal analyst at Janakiram & Associates, named the underlying pattern “The Karpathy Loop”. His formulation: any domain becomes autoresearchable when it meets three conditions.

  1. An automated experiment the agent can run without human intervention
  2. A measurable, objective metric — not vibes, not committee consensus, actual numbers
  3. Version control to cleanly revert failed experiments

Read that list slowly. Now think about how many domains it applies to.

Marketing: modify a landing page, measure conversion, keep or revert. Drug discovery: modify a molecular structure, run a binding simulation, measure affinity. Financial modelling: modify a trading strategy parameter, run a backtest, measure PnL. Compiler optimisation: modify a code path, measure execution time on target hardware, keep or revert.

Karpathy said it himself:

“Any metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm. It’s worth thinking about whether your problem falls into this bucket too.”


What the Community Did With It

The post went viral. Within days, people were running it on their own problems.

Hyperspace AI took the single-agent loop and distributed it. One overnight run, 35 autonomous agents running simultaneously over a peer-to-peer network. 333 experiments, completely unsupervised. Here’s what made it remarkable: H100s attacked the problem with brute force — aggressive learning rates, scale. The CPU-only laptop agents found clever tricks — better initialisation schemes, smarter normalisation choices.

They shared discoveries in real-time using GossipSub protocol. When one agent found that Kaiming weight initialisation dropped loss by 21%, that discovery spread to 23 other agents within hours. In 17 hours of runtime, the distributed system independently rediscovered ML milestones — RMSNorm, tied embeddings — that took human researchers at Google Brain and OpenAI approximately eight years to formalise.

Shopify CEO Tobias Lutke ran it overnight on internal data. 37 experiments. 19% performance gain.

The community immediately ported it to smaller hardware: macOS/MPS, macOS MLX, Windows with RTX cards, AMD GPUs. The H100 is no longer a requirement.


The Human Role Has Changed

This is where it gets philosophically interesting.

The old model: a human researcher spends years developing intuition, runs 2-3 meaningful experiments per day, publishes a paper, gets cited, refines the work. Talent was the bottleneck. The best researchers were the ones who could hold the most context in their heads and make the best guesses.

The new model: the bottleneck is no longer talent. It’s iteration speed. And it’s no longer the number of experiments you can run — it’s whether you can define the right metric and write a good program.md.

Karpathy:

“The goal is not to emulate a single PhD student, it’s to emulate a research community of them.”

“Humans (optionally) contribute on the edges.”

That word — optionally — deserves a beat of silence.

What’s left for humans to do? Define the question. Choose the metric. Design the experimental environment. Review the overnight results and decide what to pursue next. The work has shifted from doing experiments to designing the conditions under which experiments can happen autonomously.

That is a different skill set. And it’s one that’s not yet widely taught.


The Risks Are Real

No honest coverage of autoresearch omits the concerns.

Over-optimisation. Run the loop long enough on the same validation set and you risk tuning for the test data’s quirks rather than real-world generalisation. The agent optimises what it can measure. If your metric is slightly wrong, you get a slightly wrong model — 100 times over.

No persistent memory. Each agent run starts fresh. There’s no accumulated learning across overnight sessions. The community has flagged this; it’s a known limitation and an open research problem.

Who owns the discovery? In a distributed multi-agent loop, when 35 agents independently converge on RMSNorm, who publishes the paper? Whose IP is it? The legal infrastructure hasn’t caught up.

The recursive direction is clear. Fortune noted it explicitly: autoresearch sits uncomfortably close to the hard takeoff scenario — AI improving AI in a loop, with humans progressively less in the picture. Karpathy’s version is tightly contained. The valley between “contained” and “not contained” has an unknown width.

These aren’t reasons to dismiss autoresearch. They’re reasons to engage with it carefully, which is exactly what Karpathy invited when he released it under an MIT licence and encouraged the community to stress-test it.


What It Means

The scientific method — observe, hypothesise, experiment, measure, iterate — has been the engine of human progress for four centuries. We built institutions, careers, and entire disciplines around the assumption that humans are the ones running the loop.

Karpathy’s 630 lines of Python are the beginning of something that doesn’t require that assumption anymore.

The question isn’t whether this is coming. The overnight run results and the community response already answered that. The question is: what kind of problem do you have, and does it meet the three conditions of the Karpathy Loop?

If it does, you now have a tool that turns a night’s sleep into a research team.


Research by Mara Jade. Published on miguel.ms.