The Council Pattern: Decisive Verdicts from Disagreeing LLMs

A single confused LLM versus a nine-persona council issuing a decisive verdict

The standard advice for LLM-as-Judge setups is to reduce bias. Prompt the model to be neutral. Ask for a balanced view. Average across runs.

We do the opposite. Nine personas, each deliberately biased in opposite directions. A judge that synthesizes their disagreement instead of averaging it. Multi-sampling per persona to flag low-conviction inputs. The result is a workflow that produces categorical verdicts even on inputs where a single prompt drifts or anchors.

One caveat. Today the council runs on a single model family, so the disagreement is prompt-induced on a shared posterior, not nine independent ones. A multi-family cluster is the next iteration on the roadmap.

This post sits a layer above our previous post on the inference cluster. The same hardware that makes tens of thousands of LLM calls per decision economically viable, organized into a workflow shape we use in production for asset evaluation. The shape transfers to other high-stakes evaluation problems too.

The Failure Mode We Started With

A single-prompt LLM evaluator on a high-stakes question fails in one of two ways under repeated runs, depending on temperature and prompt specificity:

Drifts. Same input, different verdicts on different runs. Useful information is buried in the spread, but you can't tell whether a given run is the signal or the noise.
Over-anchors. Same input, same verdict every run, but the verdict is shaped by whichever framing the prompt nudged the model toward. Stable but biased, and you can't see the bias because the model performs confidence.

Asking for a "balanced view" doesn't fix either failure mode. It produces a third one: hedged verdicts of the form "I cannot quite recommend this, but I cannot quite reject it either." Verdicts everyone has seen and nobody can act on.

A structural alternative, when simpler evaluators aren't good enough: move balance from the prompt level to the system level. Compose many deliberately unbalanced prompts into a workflow whose shape exposes their disagreement instead of collapsing it.

The Architecture

                 ┌─ Persona 1 ─ (n samples) ─┐
                 ├─ Persona 2 ─ (n samples) ─┤
                 ├─ Persona 3 ─ (n samples) ─┤
   input ────────┤            ...            ├──── Judge ──── verdict
                 ├─ Persona 8 ─ (n samples) ─┤    (synthesis    + conviction
                 └─ Persona 9 ─ (n samples) ─┘     of disagreement)

Three pieces:

Nine personas, each with a fixed bias and a single lens. They aren't asked to be balanced. They're asked to do their job: find fraud, build the bull case, look for compliance issues, read the chart. They disagree by design. Each persona outputs a numeric score on its lens's dimension plus structured findings.
Multi-sampling per persona (n=3). Each persona runs three times. The output is the median verdict and the spread (range) across those three runs. High spread on a single persona is a within-persona uncertainty signal: even with a fixed bias, the model isn't converging.
A judge, run after every persona has produced its consensus, that reads the disagreement structure across personas and produces a final categorical verdict with a conviction score. The judge does not average. Averaging would defeat the purpose of having opposing biases.

Personas output numeric and structured. The judge synthesizes those into categorical and conviction. The distinction matters when you implement.

A pipeline-integrity rule sits over everything. If any persona fails to produce a result, the judge does not run. Partial input produces false confidence, and eight out of nine personas agree with one missing voice can be the inverse verdict.

A Terminology Note

What this post calls a "persona" is, in Anthropic's Building Effective Agents taxonomy, a step in a workflow (a fixed pipeline of LLM calls), not an agent in the strict sense (an LLM that dynamically chooses tools and control flow). The council pattern is a parallelization workflow with a synthesis step, not an autonomous agent system. The step from this to true agents (each persona deciding which data to fetch, which sub-questions to expand, when it has seen enough) isn't far. The workflow shape produces decisive verdicts today. The agent shape will produce decisive verdicts under broader uncertainty.

The Nine Personas (Crypto)

The lens set we use in production is for crypto asset evaluation: nine personas, each with a deliberate bias and a defined slice of data. The roster runs from a short-seller hunting fraud and a hype-agent constructing the bull case, through tokenomics, technology, traction, regulatory, narrative, founder-background, and community lenses. Each persona reads only the data category fitted to its lens, runs at a temperature tuned to its role, and outputs a numeric score on a polarity unified across the council so the judge can read agreement directly. The specific bias prompts, data-source mapping, temperature settings, and anti-hallucination guards stay internal.

Tension Pairs — The Non-Obvious Lever

Some personas are designed to oppose each other on the same risk surface. The judge pays attention when they agree:

Short Seller × Hype Agent. When the bear finds no red flags AND the bull builds a strong case from real data, the signal is asymmetric. In our current judge prompt we direct the synthesis to weight this pair first. LLMs follow such weighting instructions imperfectly, so this is a steering signal, not a hard control.
Traction × Narrative. Hard numbers vs. story. Strong story with no traction = early or hype-driven. Strong traction with no narrative attention = undervalued and undiscovered.
Background × Community. An anonymous team with a real engaged community is a different signal than an anonymous team with bot followers.

Naming opposing pairs in the judge's prompt explicitly helps the judge focus its synthesis. The judge doesn't need to discover the opposition. You tell it which two personas to read together first, and it focuses there.

Most readers will encounter the council pattern as "run multiple prompts and combine the results." The tension-pair concept is what turns it into a framework. It's also where most rotations to other domains succeed or fail.

The Judge — Synthesis, Not Averaging

The judge is an LLM call. It reads the nine persona outputs and produces a single categorical verdict, a conviction score, and a rationale that names the disagreement. It doesn't average. It reads the disagreement structure: where opposing personas converge, where individual personas are unstable, where data is missing. Then it decides what the structure means. When the disagreement doesn't resolve into a coherent picture, the verdict is no decision; that abstention is a feature, not a failure. The exact decision rules and the output enum are domain-specific and stay internal.

In production we run the judge itself with n=3 and gate on cross-run type-agreement. When the three judge sub-runs converge on the same categorical verdict, the system commits. When they split, it emits an explicit abstention. The first organic abstention in our crypto deployment fired on a real scan: the three sub-runs split 1/1/1 across three different opportunity types on the same persona panel. The system refused to commit. The verdict surfaced as judge_uncertain rather than a confident wrong call. That's the system-level analogue of high spread at the persona level. When the synthesis layer disagrees with itself, the architecture has to be willing to say so.

A common mistake when implementing this is treating the judge as a summary step. It's the synthesis step. It doesn't summarize what the personas said. It reads how they disagree and decides what that disagreement means.

Multi-Sampling and What "Spread" Means

Each persona runs n=3 times against the same input. The persona's consolidated output is (median_score, spread):

Median is the middle of the three score values. With n=3 that's the second-sorted value. It's minimal noise reduction in any strict statistical sense (one-sample order statistic on a tiny sample), but useful when two of three samples agree and the third diverges, because the median represents the agreeing pair. Treat it as a sanity statistic.
Spread is max_score − min_score, the range. Higher spread means the persona itself isn't converging: same prompt, same input, three different verdicts. The judge reads high spread as low within-persona conviction.

What we've seen in production:

n=1: drift dominates. Same input on different runs gives different verdicts.
n=3: spread becomes a useful instability flag without breaking the cost model.
n=5+: diminishing returns. Cost grows linearly; signal improvement is marginal beyond n=3 in our domain. Calibrate for yours.

n × 9 = 27 calls per decision. On hosted endpoints, that's expensive. On a self-operated inference cluster running one process per GPU, the marginal cost per call is electricity. That's the cost regime that makes the council pattern economically viable.

Lens Adaptation

The architecture is invariant. The lens set rotates per domain. The same shape (nine deliberately biased personas, multi-sampling, a synthesizing judge) appears portable to other categorical-decision domains where a single LLM prompt either drifts or anchors. Natural fits:

Content moderation on consumer platforms. Spam, toxicity, misinformation, minor safety, plus a free-speech counterweight.
Contract and compliance review. Liability, payment terms, IP, jurisdiction.
M&A target evaluation. Financial, customer concentration, integration risk.
Fraud and risk analysis. Pattern matching, identity, behavior, plus a defense persona for the customer.

The architecture, multi-sampling, and the judge output shape transfer without rotation. Persona prompts, temperatures, tension pairs, and thresholds need calibration per domain.

Pseudocode: The Whole Thing in 30 Lines

The council fits in a few dozen lines of any agent SDK. The pseudocode below shows the pipeline gate, the n-sampling, and the median/spread computation explicitly:

async def run_persona(persona, input, n=3):
    samples = await asyncio.gather(*[
        llm_call(persona.prompt, input, persona.temperature)
        for _ in range(n)
    ])
    scores = sorted(s.score for s in samples)
    return {
        "persona_id": persona.id,
        "median": scores[1],          # n=3, middle value
        "spread": scores[-1] - scores[0],
        "samples": samples,            # full structured outputs
    }


async def run_council(input, personas, judge):
    results = await asyncio.gather(*[
        run_persona(p, input) for p in personas
    ], return_exceptions=True)

    # PIPELINE-INTEGRITY GATE: judge runs only on full input
    successful = [r for r in results if not isinstance(r, Exception)]
    if len(successful) < len(personas):
        return {"status": "incomplete",
                "missing": len(personas) - len(successful)}

    # Judge sees the per-persona summary (median, spread) plus the structured
    # findings. Passing only summaries (not raw samples) reduces context-window
    # pressure but loses some context — a real implementation tradeoff.
    # Production: multi-sample the judge too (n=3) and gate on cross-run
    # type-agreement; omitted here to keep the skeleton readable.
    verdict = await llm_call(judge.prompt, input, judge.temperature,
                             persona_outputs=successful)
    return {"status": "complete", "verdict": verdict}

This isn't production code. The shape is exact: fan-out, n-sampling, median + spread, integrity gate, judge invocation. Production hardening (retries, persona-order shuffling for Lost-in-the-Middle mitigation, telemetry, schema validation) is missing. Drop the skeleton into any agent SDK and replace the llm_call shim with your client.

What This Post Is Not

This is the architecture. The artifacts that turn the architecture into a working production system are out of scope, intentionally:

The actual persona system prompts (the words that condition each lens). They come out of empirical iteration: short, opinionated prompts that produce specific structured outputs. The shape is in this post. The words aren't.
The data-source-to-persona mapping at field and feed level. Which data feeds each persona consumes, which filters, which time windows.
Conviction threshold values and the score-to-action mapping that converts the judge's output into downstream behavior.

Those are the consulting deliverables. The architecture above gets you to a working scaffold. The production-quality calibration is the work.

Conclusion

The council pattern manufactures balance at the system level. It composes deliberately unbalanced prompts so the workflow's shape exposes their disagreement. The judge synthesizes that disagreement into a categorical verdict instead of averaging it away. Owned inference capacity makes the cost structure work. Structured disagreement makes the verdicts decisive.

If you have a decision pipeline where a single prompt drifts, anchors, or hedges into uselessness (content moderation, contracts, M&A targets, fraud cases, asset bets, anything categorical under ambiguity), the council pattern is probably worth the call. Tell us what you're trying to evaluate and where the single-prompt approach is failing. We can talk about what the lens set should look like and what calibration takes.

you can contact us here