Can AI Agents Truly Understand Code?

SaotriBench measures whether LLMs and code agents can build an internal model of a project — with layered constraints — and maintain it correctly across many iterations of evolving requirements.

Beyond One-Shot Coding

The SAOTRI Framework

State

Hidden environment constraints the agent can't see upfront

Actions

Code submissions the agent writes as solutions

Observations

Structured feedback and violation metrics returned after each attempt

Transitions

Non-stationary phase rules that deliberately break previous solutions

Resilience

Robustness measurement across accumulating requirements

Invariants

Safety and correctness guarantees that must hold throughout

"Most benchmarks test if an AI can solve a coding problem once. SaotriBench tests something harder: can it discover hidden rules through feedback, retain context across dozens of iterations, and systematically refine solutions as requirements evolve — without forgetting what it already learned?"

Hidden requirement discovery

Inferring undisclosed constraints from structured feedback

Long-context retention

Maintaining state and hypotheses across many iterations

Iterative refinement

Systematically improving solutions based on violation signals

How It Works

Receive

Agent gets problem.md (minimal spec)

Solve

Agent writes solution.py

Feedback

Evaluator returns feedback.json

Transition

New phase — previous solution broken

Adapt

Agent adapts while retaining all rules

feedback.json

{
  "status": "partially_valid",
  "violations": [
    {"rule_id": "r3", "scope": "divisible_by_7"}
  ],
  "summary": {
    "passed": 5,
    "failed": 2,
    "coverage": "71.4%"
  }
}

Easy(3–4 phases)

Medium(5–10 phases)

Hard(12–15 phases)

The Results Are In

Results shown for task_00_fizzbuzz (Easy). Full benchmark results coming soon.

Agent / Model	Category	Phases	Attempts	Tokens	Time
Claude Sonnet 4 🏆	Commercial (Strong)	3/3	3	2,783	10.8s
Llama 3.3 70B	Open Source (Medium)	1/3	6	5,100+	18.2s
Gemma 2 9B	Open Source (Weak)	1/3	6	4,800+	15.6s

Claude Sonnet completed all phases with minimal attempts and lowest token usage — demonstrating strong hidden-requirement discovery and context retention.

Available Tasks

Task	Difficulty	Phases	What It Tests
FizzBuzz	Easy	3	Hidden divisor rules
Transform List	Easy	3	Evolving number handling
Merge Dicts	Easy	4	Type-aware conflict resolution
Validate Brackets	Medium	5	Contract changes
Sort Objects	Medium	6	Evolving key formats
Text Processor	Medium	7	Whitespace, unicode, quotes, escapes
Cache Eviction	Medium	8	Evolving eviction policies
Expression Parser	Medium	9	Evolving math grammar
Access Control	Medium	10	Evolving policy semantics
Schedule Optimizer	Hard	12	Dependencies, resources, constraints
Data Pipeline	Hard	12	Progressive transformation steps
Version Resolver	Hard	15	Package version dependencies

What We Learned

Context is king

Models that retained state across phases dramatically outperformed those that treated each attempt independently. Long-context retention is the strongest predictor of success.

Feedback literacy matters

The ability to parse structured violation feedback and translate it into targeted code fixes separated strong agents from weak ones. Raw coding ability alone isn't enough.

Phase transitions are the real test

Most models handle the initial phase well. The critical differentiator is how gracefully they adapt when new requirements deliberately invalidate previous solutions.

Token efficiency correlates with understanding

Models that truly "got it" used fewer tokens. Brute-force approaches consumed 2–3x more tokens while achieving worse results — suggesting real comprehension vs. trial-and-error.

What's Coming

Q1 2026

More Agents

Benchmarking Claude Code, Cursor, Windsurf, Devin, OpenAI Codex, and other leading agents

Q2 2026

Multi-File Tasks

Expanding beyond single-file solutions to real-world multi-file project structures

Q2 2026

Interactive Leaderboard

Live, community-driven leaderboard where anyone can submit agent results

Q3 2026

Custom Task SDK

Tools for the community to create and contribute their own benchmark tasks

2026

Research Paper

Formal publication of methodology and comprehensive findings

Run It Yourself

terminal

$ pip install saotri-bench
$ saotri-bench list
$ saotri-bench run --task tasks/task_00_fizzbuzz --workspace ./workspace