AI content engine for B2B buyer psychology

A multi-stage pipeline producing analytical essays from a structured knowledge base of behavioral-science findings. Two pipelines share the KB: a production graph that runs every draft, and an experimental graph used to measure architecture changes before promoting them.

Try the demo → View source

+8 pts

Thesis-as-structural-schema vs standard prompting (validated at N=50)

+9 / +13

Cross-domain delta on one bakeoff topic (N=1): Wiki alone +9, full hybrid +13

ρ = 0.841

Judge calibration against a 5-model editorial panel

305

Behavioral-science findings indexed by anxiety, Panksepp circuit, and cultural domain

763

Typed cross-domain relationships — 85% connect findings across different domains

~$0.03

Per draft in the current (Wiki-based) production pipeline vs ~$0.30 in the prior pipeline it replaced — Phase 1 bakeoff, N=5 topics (ADR-0005)

Known methodology limits: the 5-model calibration cluster was derived by dropping 2 outlier models post-hoc; the cross-domain deltas (+9 Wiki-alone, +13 full hybrid) came from a single bakeoff topic, and the N=50 replication confirmed the effect with wide variance (mean 32.2, interval 27.7–36.7). Full disclosure →

Sample output

Generated by the hybrid pipeline from KB findings on scarcity cognition, committee dynamics, and loss aversion:

The Mid-Market Trap

The VP of Operations at a $25M logistics company stares at three SaaS demos on her laptop at 4:47 PM on a Thursday. She's been in meetings since 7 AM, the CEO is asking for a vendor recommendation by Friday, and her team of eight stakeholders can't agree on basic requirements. She closes the laptop and defaults to "let's table this until Q2" — the same decision she made last quarter.

The mid-market sits in a decision-making dead zone. Too big for intuitive founder choices. Too small for enterprise decision infrastructure. They don't decide with logic first, then hire fear to audit. They decide with fear first, then hire logic to testify.

Architecture

Two graphs share one knowledge base. The production graph runs every draft; the experimental graph is the measurement surface where architecture changes are evaluated before promotion. The split is deliberate — it lets architecture decisions be driven by measured results rather than opinion.

Production pipeline

src/content_pipeline/graph.py:671 · runs every draft · Wiki-based retrieval (ADR-0005)

calendar_trigger ──── scheduled topic from editorial calendar │ kb_retriever ──────── multi-query retrieval + decay-weighted scoring │ content_selector ──── ranks + enforces ≥1 cross-domain finding │ outline_generator ─── thesis-constrained outline │ (fear-commit → logic-recruit → testimony-deploy) HITL gate ─────────── outline review │ draft_generator ───── voice profile, inline citation markers │ bvcs_scorer ───────── voice compliance (fail → revise, max 3 loops) │ HITL gate ─────────── draft review │ publisher_stub ────── write markdown to disk

Experimental pipeline

src/content_pipeline/experimental/hybrid_graph.py:596 · measurement surface, not production · adversarial critique + revision gating live here because they're being evaluated, not because they're live

topic_router ───────── classifies topic, routes to retrieval strategy ├─ wiki_selector ── reads KB index, picks findings across domains ├─ vector_retriever ── pgvector similarity search within cluster └─ graph_walker ──── PPR traversal + MMR diversity reranking │ graph_expander ────── adds cross-domain findings via relationship graph │ outline_generator ─── thesis-constrained │ draft_generator ───── focused context, voice profile, citations │ bvcs_scorer ───────── voice compliance (fail → revise, max 3) │ adversarial_critic ── a different model reads full KB + draft │ revision_gate ─────── Pareto filter: improve without regressing │ HITL gate ─────────── draft review → publisher

The knowledge base

An anxiety-indexed knowledge graph modeling how fear drives executive decision-making. Not a document store — a structured model of buyer psychology.

Root anxieties

Helplessness · Insignificance · Isolation · Meaninglessness · Mortality

Affective circuits

Panksepp circuits mapped to buyer behavior

Cultural domains

Competitive systems · Wealth · Tribalism · Technology · Religion · +19 more

Relationship types

supports · extends · qualifies · subsumes · reframes · contradicts — each edge carries a rationale

Evaluation methodology

Every architectural decision was driven by measured results, not assumptions.

10-criterion rubric grounded in Minto Pyramid, BCG action-titles, Toulmin argument model, and Berger-Milkman sharing research.

7-model editorial panel — each independently ranked 12 drafts on the same criterion. 2 outliers dropped post-hoc — a pre-registered protocol would specify drop criteria in advance (disclosed in methodology writeup).

Spearman ρ = 0.841 against the 5-model tight cluster. Recalibrated on Sonnet (ρ = 0.782) for 8× cheaper ongoing scoring.

N=50 validation across five density conditions with confidence intervals.

CAG negative result: full-context stuffing scored 26.3 vs 32.0 for focused approaches. First measured result on a generation task.

Read the full methodology →

Writing

→

I calibrated an LLM judge against 7 models. Then I transferred it to code review in an afternoon.
Eval methodology, multi-model calibration, the pre-registration flaw, and what happened when the harness moved to a new domain. April 2026.

Built by Tom Kuhns

Agentic engineer building evaluated AI systems with Claude Code. Background in B2B SaaS, working with procurement committees, post-sale implementation teams, and the moments where pilots actually fail. The buyer psychology domain expertise comes from years inside the sales cycle.

The portfolio isn't just the code. It's the evaluation methodology, the architectural decisions, and the 20 research reports that ground every design choice.

Open to Applied AI, Agentic Engineer, Forward Deployed, and DevRel roles at AI-native companies. Remote from Youngstown, OH.
thomasjkuhns@gmail.com · LinkedIn

Claude Sonnet 4 LangGraph pgvector PostgreSQL Python igraph Supabase Streamlit Claude Code LLM-as-judge eval harness Spearman calibration