Can AI Agents Truly Understand Code?

SaotriBench measures whether LLMs and code agents can build an internal model of a project — with layered constraints — and maintain it correctly across many iterations of evolving requirements.

Satori Monkey — SaotriBench mascot

Beyond One-Shot Coding

The SAOTRI Framework

S

State

Hidden environment constraints the agent can't see upfront

A

Actions

Code submissions the agent writes as solutions

O

Observations

Structured feedback and violation metrics returned after each attempt

T

Transitions

Non-stationary phase rules that deliberately break previous solutions

R

Resilience

Robustness measurement across accumulating requirements

I

Invariants

Safety and correctness guarantees that must hold throughout

"Most benchmarks test if an AI can solve a coding problem once. SaotriBench tests something harder: can it discover hidden rules through feedback, retain context across dozens of iterations, and systematically refine solutions as requirements evolve — without forgetting what it already learned?"
1

Hidden requirement discovery

Inferring undisclosed constraints from structured feedback

2

Long-context retention

Maintaining state and hypotheses across many iterations

3

Iterative refinement

Systematically improving solutions based on violation signals

How It Works

1

Receive

Agent gets problem.md (minimal spec)

2

Solve

Agent writes solution.py

3

Feedback

Evaluator returns feedback.json

4

Transition

New phase — previous solution broken

5

Adapt

Agent adapts while retaining all rules

feedback.json
{
  "status": "partially_valid",
  "violations": [
    {"rule_id": "r3", "scope": "divisible_by_7"}
  ],
  "summary": {
    "passed": 5,
    "failed": 2,
    "coverage": "71.4%"
  }
}
Easy(3–4 phases)
Medium(5–10 phases)
Hard(12–15 phases)

The Results Are In

Results shown for task_00_fizzbuzz (Easy). Full benchmark results coming soon.

Agent / ModelCategoryPhasesAttemptsTokensTime
Claude Sonnet 4 🏆Commercial (Strong)3/332,78310.8s
Llama 3.3 70B Open Source (Medium)1/365,100+18.2s
Gemma 2 9B Open Source (Weak)1/364,800+15.6s

Claude Sonnet completed all phases with minimal attempts and lowest token usage — demonstrating strong hidden-requirement discovery and context retention.

Available Tasks

TaskDifficultyPhasesWhat It Tests
FizzBuzzEasy3Hidden divisor rules
Transform ListEasy3Evolving number handling
Merge DictsEasy4Type-aware conflict resolution
Validate BracketsMedium5Contract changes
Sort ObjectsMedium6Evolving key formats
Text ProcessorMedium7Whitespace, unicode, quotes, escapes
Cache EvictionMedium8Evolving eviction policies
Expression ParserMedium9Evolving math grammar
Access ControlMedium10Evolving policy semantics
Schedule OptimizerHard12Dependencies, resources, constraints
Data PipelineHard12Progressive transformation steps
Version ResolverHard15Package version dependencies

What We Learned

Context is king

Models that retained state across phases dramatically outperformed those that treated each attempt independently. Long-context retention is the strongest predictor of success.

Feedback literacy matters

The ability to parse structured violation feedback and translate it into targeted code fixes separated strong agents from weak ones. Raw coding ability alone isn't enough.

Phase transitions are the real test

Most models handle the initial phase well. The critical differentiator is how gracefully they adapt when new requirements deliberately invalidate previous solutions.

Token efficiency correlates with understanding

Models that truly "got it" used fewer tokens. Brute-force approaches consumed 2–3x more tokens while achieving worse results — suggesting real comprehension vs. trial-and-error.

What's Coming

Q1 2026

More Agents

Benchmarking Claude Code, Cursor, Windsurf, Devin, OpenAI Codex, and other leading agents

Q2 2026

Multi-File Tasks

Expanding beyond single-file solutions to real-world multi-file project structures

Q2 2026

Interactive Leaderboard

Live, community-driven leaderboard where anyone can submit agent results

Q3 2026

Custom Task SDK

Tools for the community to create and contribute their own benchmark tasks

2026

Research Paper

Formal publication of methodology and comprehensive findings

Run It Yourself

terminal
$ pip install saotri-bench
$ saotri-bench list
$ saotri-bench run --task tasks/task_00_fizzbuzz --workspace ./workspace