Agentic Coding + Building Evals

Scout runs a patch tournament for AI-written code.

Scout is an eval-backed review layer for code produced by coding agents. It looks for hallucinated APIs, spec drift, and tests that pass without proving behavior. Then competing repair agents produce patches, a deterministic scorer ranks them, and a receipt captures the handoff.

Why this pathway

The selected hackathon path is Agentic Coding + Building Evals. Generic AI code review is already crowded. Scout is narrower: it evaluates the codebase you and your AI write together, and it can show a benchmark score against planted AI-code mistakes.

Pipeline

Load a GitHub repo or the deterministic demo://ai-written-code-seed fixture.
Run three specialist scouts in parallel.
Judge the findings, dedupe overlaps, and label verdicts.
Show seeded recall only when an answer key exists; otherwise show a live review summary.
Spawn Conservative, Idiomatic, and Robust repair agents for any finding.
Validate patch shape, apply candidates in a temp workspace, rank eligible repairs, and export a receipt.

Proof boundaries

Seeded demo

Offline fixture streams and seven planted mistakes. Scout can report recall, critical recall, precision, and gates.

Live target repo

Real model calls against a public repo that has a known answer key, so found and missed target issues stay visible.

Arbitrary live repo

No answer key is claimed. Scout reports confirmed, likely, and speculative findings without pretending they are benchmark recall.

Specialist scouts

Hallucination Scout

Finds fake imports, impossible APIs, and nonexistent helpers.

Spec Drift Scout

Finds comments, README claims, and names that lie about behavior.

Test Theater Scout

Finds tests that pass without proving meaningful behavior.

Seeded benchmark

Demo mode plants seven realistic AI-code mistakes: fake package import, nonexistent helper, raw email logging despite a redaction comment, permissive bearer parsing, missing rate limiting, a toBeTruthy() test, and a telemetry test that never checks whether PII is removed.

The judge separates confirmed, likely, and speculative findings so the demo can claim measured recall without hiding noise.

Repair agents

Conservative

Smallest possible diff. Surgical repair only.

Idiomatic

Align with existing contracts and repo conventions.

Robust

Fix the bug and prove the contract with tests.

Context budget

Live review keeps static scout rules at the front of each prompt and repo-specific context at the end. The app shows inspected files, estimated input tokens, stable prompt cache keys, and OpenAI usage metadata when the stream returns it, including cached input tokens.

Patch safety gate

Scout rejects malformed patch output before it can win. The scoring route requires a plain unified diff, applies each valid candidate in a temporary workspace, and marks failed applies, unavailable repo context, failed checks, or unsafe check commands as ineligible.

Patch checks run with a stripped environment, so API keys and repository credentials are not inherited by candidate execution. The demo also includes a deterministic malformed-patch proof button, so disqualification can be shown without faking a model failure.

MCP surface

Scout also runs as an official TypeScript SDK MCP server over stdio. Coding agents can callscout_review, scout_fix, scout_score_patch, scout_handoff, and scout_eval. The server also exposes native resources for the seeded manifest, seeded eval, and demo handoff prompt, plus native prompts for review, patch tournament, and Codex handoff workflows.

Seeded MCP eval is offline and deterministic. Live scout_review and scout_fix use the same bounded GitHub context and configured OpenAI model path as the web app. The repeatable live smoke command is npm run scout:mcp -- --smoke-live; it requires network access and OPENAI_API_KEY.

Deterministic versus model output

Model-generated: live scout text, live finding candidates, and live patch text.
Deterministic: schema validation, judge grouping, seeded answer-key scoring, patch shape checks, patch apply eligibility, checksums, and receipts.
Heuristic: risk score and patch score. They are visible ranking signals, not a production security audit.

Code map

src/lib/demo-fixtures.ts   seeded benchmark and deterministic patches
src/lib/prompts.ts         specialist scout and repair prompts
src/lib/live-runner.ts     shared live OpenAI runner for API and MCP
src/lib/judge.ts           dedupe, verdicts, eval score
src/lib/patch-executor.ts  temp-workspace patch apply and safety checks
src/app/api/review         live or seeded review stream
src/app/api/fix            live or seeded repair stream
src/lib/context-budget.ts  token estimate, cache keys, usage telemetry
src/mcp/server.ts          official SDK MCP tools, resources, prompts
src/components/scout       modular product UI