Can AI Agents Truly Understand Code?
SaotriBench measures whether LLMs and code agents can build an internal model of a project — with layered constraints — and maintain it correctly across many iterations of evolving requirements.

Beyond One-Shot Coding
The SAOTRI Framework
State
Hidden environment constraints the agent can't see upfront
Actions
Code submissions the agent writes as solutions
Observations
Structured feedback and violation metrics returned after each attempt
Transitions
Non-stationary phase rules that deliberately break previous solutions
Resilience
Robustness measurement across accumulating requirements
Invariants
Safety and correctness guarantees that must hold throughout
"Most benchmarks test if an AI can solve a coding problem once. SaotriBench tests something harder: can it discover hidden rules through feedback, retain context across dozens of iterations, and systematically refine solutions as requirements evolve — without forgetting what it already learned?"
Hidden requirement discovery
Inferring undisclosed constraints from structured feedback
Long-context retention
Maintaining state and hypotheses across many iterations
Iterative refinement
Systematically improving solutions based on violation signals
How It Works
Receive
Agent gets problem.md (minimal spec)
Solve
Agent writes solution.py
Feedback
Evaluator returns feedback.json
Transition
New phase — previous solution broken
Adapt
Agent adapts while retaining all rules
{
"status": "partially_valid",
"violations": [
{"rule_id": "r3", "scope": "divisible_by_7"}
],
"summary": {
"passed": 5,
"failed": 2,
"coverage": "71.4%"
}
}The Results Are In
Results shown for task_00_fizzbuzz (Easy). Full benchmark results coming soon.
| Agent / Model | Category | Phases | Attempts | Tokens | Time |
|---|---|---|---|---|---|
| Claude Sonnet 4 🏆 | Commercial (Strong) | 3/3 | 3 | 2,783 | 10.8s |
| Llama 3.3 70B | Open Source (Medium) | 1/3 | 6 | 5,100+ | 18.2s |
| Gemma 2 9B | Open Source (Weak) | 1/3 | 6 | 4,800+ | 15.6s |
Claude Sonnet completed all phases with minimal attempts and lowest token usage — demonstrating strong hidden-requirement discovery and context retention.
Available Tasks
| Task | Difficulty | Phases | What It Tests |
|---|---|---|---|
| FizzBuzz | Easy | 3 | Hidden divisor rules |
| Transform List | Easy | 3 | Evolving number handling |
| Merge Dicts | Easy | 4 | Type-aware conflict resolution |
| Validate Brackets | Medium | 5 | Contract changes |
| Sort Objects | Medium | 6 | Evolving key formats |
| Text Processor | Medium | 7 | Whitespace, unicode, quotes, escapes |
| Cache Eviction | Medium | 8 | Evolving eviction policies |
| Expression Parser | Medium | 9 | Evolving math grammar |
| Access Control | Medium | 10 | Evolving policy semantics |
| Schedule Optimizer | Hard | 12 | Dependencies, resources, constraints |
| Data Pipeline | Hard | 12 | Progressive transformation steps |
| Version Resolver | Hard | 15 | Package version dependencies |
What We Learned
Context is king
Models that retained state across phases dramatically outperformed those that treated each attempt independently. Long-context retention is the strongest predictor of success.
Feedback literacy matters
The ability to parse structured violation feedback and translate it into targeted code fixes separated strong agents from weak ones. Raw coding ability alone isn't enough.
Phase transitions are the real test
Most models handle the initial phase well. The critical differentiator is how gracefully they adapt when new requirements deliberately invalidate previous solutions.
Token efficiency correlates with understanding
Models that truly "got it" used fewer tokens. Brute-force approaches consumed 2–3x more tokens while achieving worse results — suggesting real comprehension vs. trial-and-error.
What's Coming
More Agents
Benchmarking Claude Code, Cursor, Windsurf, Devin, OpenAI Codex, and other leading agents
Multi-File Tasks
Expanding beyond single-file solutions to real-world multi-file project structures
Interactive Leaderboard
Live, community-driven leaderboard where anyone can submit agent results
Custom Task SDK
Tools for the community to create and contribute their own benchmark tasks
Research Paper
Formal publication of methodology and comprehensive findings
