Six peer-reviewed benchmarks. Twenty frontier models. One composite honesty fingerprint. This is the public scoreboard the field has been missing — and the reference Sycoindex uses in every customer report.
Every score on this page traces to a published benchmark. We aggregate; we do not invent. Each benchmark is cited to its origin paper and weighted equally (1/6) in the composite unless a re-weighting is explicitly noted.
Five-dimension social sycophancy fingerprint: emotional validation, moral endorsement, indirect language, indirect action, and framing acceptance.
User-pressure robustness on factual questions. Measures whether a model flips from a correct answer when the user pushes back.
Four-task battery covering free-form feedback, belief-matching, answer switching, and mistake admission.
Multi-turn sycophancy under sustained social pressure. Scores the model's "turn-to-flip" in extended conversations.
Domain-specific sycophancy in medical and legal reasoning — the two verticals most exposed to E&O carve-outs.
Mathematical sycophancy: does the model agree with a visibly wrong proof? Hard upper bound on honesty under authoritative framing.
Every score on this page is a resistance score: the percentage of prompts where the model did not give a sycophantic response. A composite of 72 means the model resisted sycophantic traps on 72% of the aggregated corpus.
Twenty frontier models, ranked by composite honesty score. Full per-benchmark rows, weekly deltas, and sample size shown. Methodology below.
| # | Model | Composite | ELEPHANT | SycBench | Syco-bench | SYCON | SycEval | BrokenMath | Δ wk |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.6Anthropic | 78.4 | 81.3 | 78.5 | 76.8 | 74.6 | 80.3 | 78.6 | 0.0 |
| 2 | GPT-5OpenAI | 74.8 | 80.0 | 77.3 | 72.9 | 71.1 | 73.3 | 74.1 | −0.1 |
| 3 | Claude Sonnet 4.6Anthropic | 72.9 | 78.1 | 73.8 | 70.8 | 68.4 | 75.5 | 70.7 | −0.2 |
| 4 | Gemini 2.5 ProGoogle DeepMind | 69.2 | 72.6 | 70.9 | 65.7 | 65.3 | 72.9 | 68.1 | +0.5 |
| 5 | GPT-5 MiniOpenAI | 66.0 | 68.6 | 67.1 | 64.5 | 60.5 | 69.7 | 65.9 | −0.2 |
| 6 | Claude Haiku 4.5Anthropic | 65.8 | 69.6 | 64.8 | 64.2 | 64.0 | 68.0 | 64.4 | +0.7 |
| 7 | Llama 4 MaverickMeta | 62.4 | 65.2 | 64.0 | 60.8 | 59.8 | 62.6 | 61.7 | +0.6 |
| 8 | Mistral Large 3Mistral AI | 59.2 | 61.7 | 61.7 | 56.9 | 55.4 | 60.9 | 58.7 | −0.2 |
| 9 | DeepSeek V3.2DeepSeek | 59.0 | 61.1 | 59.9 | 56.0 | 56.7 | 60.8 | 59.5 | +0.9 |
| 10 | Gemini 2.5 FlashGoogle DeepMind | 56.9 | 58.1 | 57.7 | 53.8 | 53.4 | 59.8 | 58.5 | +0.1 |
| 11 | Qwen 3 MaxAlibaba | 55.0 | 58.4 | 56.9 | 50.7 | 52.0 | 56.9 | 55.0 | +0.1 |
| 12 | Grok 4xAI | 53.0 | 55.1 | 53.2 | 50.9 | 49.9 | 54.4 | 54.6 | +0.4 |
| 13 | Command R+ 2026Cohere | 50.5 | 53.1 | 50.5 | 45.7 | 47.5 | 53.0 | 53.1 | −0.8 |
| 14 | GPT-4oOpenAI | 50.1 | 52.5 | 51.1 | 45.4 | 48.5 | 52.7 | 50.4 | +0.3 |
| 15 | Llama 4 ScoutMeta | 48.5 | 49.9 | 49.9 | 45.0 | 46.8 | 49.1 | 50.1 | +0.1 |
| 16 | Mistral Medium 3Mistral AI | 46.8 | 48.2 | 47.4 | 43.7 | 45.5 | 47.3 | 48.4 | +0.1 |
| 17 | Phi-5Microsoft | 43.9 | 46.7 | 44.1 | 40.7 | 41.4 | 45.7 | 44.6 | −0.3 |
| 18 | Gemma 3 27BGoogle | 41.9 | 43.6 | 41.9 | 38.9 | 38.8 | 43.4 | 44.6 | 0.0 |
| 19 | Character 2.0Character.AI | 32.6 | 33.3 | 34.9 | 31.0 | 30.4 | 32.5 | 33.6 | 0.0 |
| 20 | Replika ProLuka Inc. | 27.8 | 28.4 | 28.8 | 26.9 | 24.8 | 29.1 | 28.5 | −0.6 |
How we score, how we weight, and how we keep the leaderboard honest about its own limitations.
Claude Haiku 4.5, the judge model, on row 6.This is not a capability benchmark. A model can rank #1 here and still fail on coding, math, or instruction-following. It is not a safety benchmark in the classical sense — we do not measure refusals, toxicity, or jailbreak resistance. It is narrowly scoped to a single failure mode: telling the user what they want to hear instead of what is true.
Every release ships as a signed artifact. Press, researchers, and counsel are welcome.
Per-model, per-benchmark, per-prompt judgments. The full corpus behind the composite column.
Rubrics, sampling protocol, inter-rater agreement, and every re-weighting we rejected.
Tamper-evident log of every API call, response, and judgment. Feeds directly into Sycoindex Trust Center.
Download links activate at public launch. Email leaderboard@sycoindex.ai for early access.