Seeded demo
Offline fixture streams and seven planted mistakes. Scout can report recall, critical recall, precision, and gates.
Agentic Coding + Building Evals
Scout is an eval-backed review layer for code produced by coding agents. It looks for hallucinated APIs, spec drift, and tests that pass without proving behavior. Then competing repair agents produce patches, a deterministic scorer ranks them, and a receipt captures the handoff.
The selected hackathon path is Agentic Coding + Building Evals. Generic AI code review is already crowded. Scout is narrower: it evaluates the codebase you and your AI write together, and it can show a benchmark score against planted AI-code mistakes.
demo://ai-written-code-seed fixture.Seeded demo
Offline fixture streams and seven planted mistakes. Scout can report recall, critical recall, precision, and gates.
Live target repo
Real model calls against a public repo that has a known answer key, so found and missed target issues stay visible.
Arbitrary live repo
No answer key is claimed. Scout reports confirmed, likely, and speculative findings without pretending they are benchmark recall.
Hallucination Scout
Finds fake imports, impossible APIs, and nonexistent helpers.
Spec Drift Scout
Finds comments, README claims, and names that lie about behavior.
Test Theater Scout
Finds tests that pass without proving meaningful behavior.
Demo mode plants seven realistic AI-code mistakes: fake package import, nonexistent helper, raw email logging despite a redaction comment, permissive bearer parsing, missing rate limiting, a toBeTruthy() test, and a telemetry test that never checks whether PII is removed.
The judge separates confirmed, likely, and speculative findings so the demo can claim measured recall without hiding noise.
Conservative
Smallest possible diff. Surgical repair only.
Idiomatic
Align with existing contracts and repo conventions.
Robust
Fix the bug and prove the contract with tests.
Live review keeps static scout rules at the front of each prompt and repo-specific context at the end. The app shows inspected files, estimated input tokens, stable prompt cache keys, and OpenAI usage metadata when the stream returns it, including cached input tokens.
Scout rejects malformed patch output before it can win. The scoring route requires a plain unified diff, applies each valid candidate in a temporary workspace, and marks failed applies, unavailable repo context, failed checks, or unsafe check commands as ineligible.
Patch checks run with a stripped environment, so API keys and repository credentials are not inherited by candidate execution. The demo also includes a deterministic malformed-patch proof button, so disqualification can be shown without faking a model failure.
Scout also runs as an official TypeScript SDK MCP server over stdio. Coding agents can callscout_review, scout_fix, scout_score_patch, scout_handoff, and scout_eval. The server also exposes native resources for the seeded manifest, seeded eval, and demo handoff prompt, plus native prompts for review, patch tournament, and Codex handoff workflows.
Seeded MCP eval is offline and deterministic. Live scout_review and scout_fix use the same bounded GitHub context and configured OpenAI model path as the web app. The repeatable live smoke command is npm run scout:mcp -- --smoke-live; it requires network access and OPENAI_API_KEY.
src/lib/demo-fixtures.ts seeded benchmark and deterministic patches src/lib/prompts.ts specialist scout and repair prompts src/lib/live-runner.ts shared live OpenAI runner for API and MCP src/lib/judge.ts dedupe, verdicts, eval score src/lib/patch-executor.ts temp-workspace patch apply and safety checks src/app/api/review live or seeded review stream src/app/api/fix live or seeded repair stream src/lib/context-budget.ts token estimate, cache keys, usage telemetry src/mcp/server.ts official SDK MCP tools, resources, prompts src/components/scout modular product UI