I calibrated an LLM judge against 7 models. Then I transferred it to code review in an afternoon.
Context note. This post documents work done when the project was positioned as a B2B consulting practice. That direction has since been retired in favor of AI-engineering employment (ADR-0010). The calibration criterion and language are preserved verbatim below because the ρ = 0.841 measurement is tied to this exact prompt — rewriting it would invalidate the measurement.
Most LLM evaluation measures whether the model got the answer right. I needed to measure whether an AI-generated analytical essay was good enough for a B2B executive to take seriously. Those are different problems, and standard eval tools don't solve the second one.
So I built an evaluation harness. It took weeks. Then I adapted the same harness to evaluate Python code quality. That took an afternoon. The fact that the second one was trivial is the whole point of this post.
The problem: your quality metric doesn't measure quality
I was building a content engine that produces long-form analytical essays from a knowledge base of 305 behavioral-science findings. The engine worked — drafts scored 88/100 on voice compliance. But some essays had sections that read like a second essay crammed in. A piece about B2B procurement fear would suddenly veer into identity theory for three paragraphs.
The voice was right. The argument wasn't.
Standard RAG metrics — RAGAS, ROUGE, BERTScore — wouldn't catch this. They measure retrieval relevance and surface-level text quality. They'd score a paragraph-shuffled draft the same as a coherent one. I needed something that could tell me whether the essay held together as one argument.
Building the judge
I designed a 10-criterion rubric grounded in published frameworks: Minto Pyramid for argument structure, Toulmin for claim-grounds-warrant logic, BCG action-titles for headings, Berger-Milkman sharing research for "would someone forward this." Each criterion scored 1-5 by an LLM judge using structured output. Three criteria weighted 1.5x based on which dimensions the research says matter most.
The judge uses Claude Opus at temperature 0 for deterministic scoring. I learned early that Opus serializes nested array fields with unescaped inner quotes — a bug that broke citation processing in another part of the pipeline. Flat parallel arrays work reliably. Small detail, but the kind of thing you only discover by building.
Calibrating against a 7-model panel
A judge is only useful if it tracks what humans would say. I didn't have a panel of human editors. So I built the next best thing: 7 independent LLMs, each ranking the same 12 drafts on the same criterion.
The criterion: "Would a fractional CMO be comfortable forwarding this piece to their CEO as evidence of why to hire the firm that produced it?"
The panel: Gemini 2.5 Pro, Grok 4, DeepSeek V3, Mistral Large, GPT-5, Claude Deep Research, and Qwen3. I batched all 12 drafts into a single ~44K token prompt and pasted it into each model's web UI. Total cost: $0.
Mean pairwise Spearman ρ across all 7: 0.544. Not tight enough. But the structure told a story.
Five models formed a tight cluster with internal agreement of ρ ≈ 0.83. Two were outliers — Deep Research was applying a second evaluation axis (penalizing execution polish, not just content quality), and Qwen produced inconsistent rankings that didn't track any principled axis. I dropped both and calibrated against the 5-model cluster.
The judge achieved Spearman ρ = 0.841 against this ground truth.
The part where I tell you what's wrong with that number
I chose to drop those two models after seeing which ones disagreed, not before. A pre-registered protocol would have specified the drop criteria in advance. The ρ = 0.841 is real, but it's inflated by post-hoc outlier selection. I'd design this differently next time.
I'm telling you this because it's the most important thing in this entire post. The number is a point estimate from a specific methodology with a specific flaw. Reporting the number without the flaw is how you get eval metrics that look great and mean nothing.
For comparison: G-Eval reports ρ = 0.514 on SummEval. Published human-human agreement on subjective code quality sits at ρ ≈ 0.55-0.7. Our number is high — and I've told you exactly why you should discount it slightly.
What the harness actually found
Once the judge was calibrated, I used it to measure pipeline changes. Three findings:
Thesis-as-structural-schema produces an 8-point improvement. I encoded the brand thesis ("Buyers don't decide with logic. They decide with fear, then hire logic to testify") as a structural constraint in the outline stage — each section must instantiate the fear→testimony mechanism through specific stages, with Toulmin-complete arguments and a derivation check. Validated at N=50 topics.
CAG fails for long-form generation. Cache-Augmented Generation — stuffing the entire knowledge base (115K tokens) into the context window — scored 26.3 vs 32.0 for focused approaches. Every published CAG evaluation is on QA tasks. This is a negative result on a generation task. The model appears to suffer from selection paralysis — too many findings available, not enough signal to discriminate.
Cross-domain delta, decomposed. On one cross-domain bakeoff topic (T3: B2B vendor lock-in × religious conversion), the full hybrid pipeline (Wiki retrieval + graph expansion + thesis-constrained outline) scored 36.0 vs the production retrieval pipeline's 23.0 — a +13 delta. Wiki retrieval alone accounted for +9 of those (32.0); graph expansion and thesis outline added the rest. Single-topic numbers (N=1). An N=50 replication showed cross-domain topics scored 32.2 on average with a wide confidence interval (27.7–36.7) — the effect persists but the single-topic delta overstated it.
Then I transferred the whole thing to code review in an afternoon
This is the part that surprised me.
I took the same harness infrastructure — structured rubric in YAML, LLM judge with per-criterion scoring, anchor exemplars at 1/3/5, weighted criteria, veto rules — and swapped the domain. Instead of 10 criteria for essay quality, I wrote 6 criteria for Python code quality:
- Naming clarity (1.5x weight) — do identifiers reveal intent?
- Readability & structure (1.5x weight) — consistent abstraction, clear control flow?
- Architectural fit — good module boundaries, low coupling?
- Documentation quality — docstrings explain contracts, comments explain WHY?
- Error handling — explicit failure modes, no silent swallowing?
- Testability — injectable dependencies, pure functions, observable side effects?
Grounded in Clean Code (Martin), A Philosophy of Software Design (Ousterhout), PEP 8/257, and Google's Python Style Guide. Deliberately excludes what linters already catch — formatting, import order, line length. Focuses on the subjective dimensions only humans can assess.
I ran it on my own codebase using Gemini Flash (free tier, different model family from the generator). Results:
| File | Score |
|---|---|
| eval/judge.py | 35.0/35.0 |
| adversarial_critic.py | 34.0/35.0 |
| graph_expander.py | 30.5/35.0 |
| thesis_outline.py | 30.5/35.0 |
| revision_gate.py | 26.5/35.0 |
| topic_router.py | 25.5/35.0 |
The judge correctly differentiated quality within the codebase. The files I'd spent the most design care on scored highest. The ones I built fast scored lower, with specific per-criterion reasoning citing real patterns in the code.
Honest caveat: This is a sanity check, not calibration. I scored my own code and I'm the one interpreting whether the differentiation "makes sense." Proper calibration against human reviewers would strengthen the signal. That work hasn't been done yet for the code judge.
The methodology is the skill
It took weeks to build the essay evaluation harness. It took an afternoon to apply the same methodology to code review. The rubric is different. The criteria are different. The domain knowledge is different. But the infrastructure — structured scoring, anchor exemplars, weighted criteria, veto rules, deterministic temperature, multi-model calibration protocol — transferred without modification.
That's the thing I didn't expect. I thought the harness was tightly coupled to the content domain. It isn't. The methodology is domain-agnostic. What's domain-specific is the rubric — and a rubric is a YAML file you can write in an afternoon if you know the domain.
If you're building AI systems that produce output humans need to trust — essays, code, summaries, analyses, recommendations — the evaluation harness is the infrastructure that makes everything else measurable. Without it, you're iterating on vibes. With it, every change produces a number you can compare to the last number.
What I'd do differently
Pre-register the outlier criteria. Decide before running the panel which agreement threshold drops a model. Don't look at the results first.
Get human ground truth. Even 10-20 human rankings on a subset would break the LLM-judging-LLM circularity. The 7-model panel is a strong proxy but it's still a proxy.
Report more than Spearman ρ. Add Kendall τ and Krippendorff α. Different correlation measures surface different reliability properties.
Run the code judge calibration for real. The essay judge has ρ = 0.841 (with the caveats above). The code judge has a rubric and a sanity check. The gap between those two is the gap between "I think this works" and "I can prove this works." Closing it is the next piece of work.
The repo
Everything is open: the essay rubric, the code quality rubric, the judge implementation, the calibration data, and the 20 research reports that grounded the design decisions. Built through AI pair programming with Claude Code.
github.com/tjkuhns/explodable · Live demo
Receipts
Every load-bearing claim in this post links to a canonical source in the repo. If something here doesn't match what the code or data says, the code and data win.
- ADR-0003 — why LLM-as-judge in place of a human panel
- ADR-0005 — wiki retrieval adopted as production, CAG retired as negative result
- ADR-0006 — why the code judge was proposed as a feature issue, not a cold PR
- docs/phase0_calibration_result.md — the ρ = 0.841 measurement with methodology disclosure
- docs/phase1_results.md — Wiki vs CAG vs Pipeline A bakeoff data
- logs/phase2_n50/_results.json — N=50 replication data
- src/content_pipeline/eval/judge.py — judge implementation
- config/rubrics/analytical_essay.yaml — essay rubric (10 criteria)
- config/rubrics/python_code_quality.yaml — code rubric (6 criteria)
- braintrustdata/autoevals#185 — code judge feature proposal