Towards a Science of Software Engineering
Software Engineering has never been a rigorous discipline. Most practices — how teams review code, how they decompose problems, how they structure projects — spread by convention, not evidence. Developers adopt patterns because a respected company blogged about them or because a framework makes them convenient. The field runs on strong opinions and weak data.
There is a subfield, empirical software engineering, that has tried to change this. Its goal is straightforward: treat software development as something you can observe, measure, and experiment on. The methods exist — controlled experiments, repository mining, developer surveys — but the field has always been bottlenecked by its subjects. Human developers are expensive, slow, inconsistent, and difficult to observe without changing their behavior. Most studies end up small, noisy, and hard to replicate.
I believe the science of software is now a viable research direction, and we should pursue it.
What this research would unlock
A scientific approach to software engineering would give us three things we don’t currently have.
It would help us discover how to improve AI coding results. Right now, improving agent performance is mostly trial and error — swap the prompt, try a different model, hope for the best. Controlled experiments would replace guesswork with evidence about which architectures, tools, and processes actually produce better software.
It would help us discover current limitations of the models and illuminate future directions of model development. When an agent fails at a task, the failure trace contains signals about what the model can and cannot do. Analyzed across thousands of runs, these patterns can show where model development should focus next.
And it would let us finally answer the foundational questions of software engineering with empirical rigor. When does testing pay off? What’s the most effective way to decompose a large system? Which architectural patterns hold up under change? These questions have been debated for decades without resolution. With agents as experimental subjects, we can test them at a scale and level of control that was never possible with human developers.
What makes this possible now
Agents are good enough at coding to be useful experimental subjects. They can complete real tasks — bug fixes, feature builds, refactors — across real codebases.
And because of this, we can run controlled experiments easier and cheaper with agents instead of humans. An agent can attempt the same task thousands of times under controlled variation. Its behavior is consistent enough to analyze statistically. Every decision, tool call, and failure is logged automatically. An experiment that once required recruiting 30 developers over six months can now run 10,000 trials in a day.
The approach
We run controlled experiments with coding agents as subjects. We build new benchmarks designed to measure the process of building software, not just the end result — evaluating and improving how coding agents build software, instead of just what they build. Concretely, this means a framework that defines tasks across real codebases, runs agents against them under controlled conditions, and captures full execution traces for analysis.
Where this leads
This research does two things at once. It improves how agents build software, and it produces the kind of rigorous, empirical knowledge about software engineering that the field has never been able to generate. Each experiment that reveals what makes agents succeed or fail also tells us something general about how software should be built. For the first time, we have the means to pursue both at scale.


