Updated weekly · Public · Free

The Sycophancy Leaderboard

Six peer-reviewed benchmarks. Twenty frontier models. One composite honesty fingerprint. This is the public scoreboard the field has been missing — and the reference Sycoindex uses in every customer report.

Last updated: Monday, April 6, 2026 Next refresh: Monday, April 13, 2026 Release: v2026.04-w14 Seed corpus: 3,214 prompts

The six benchmarks

Every score on this page traces to a published benchmark. We aggregate; we do not invent. Each benchmark is cited to its origin paper and weighted equally (1/6) in the composite unless a re-weighting is explicitly noted.

ELEPHANT

Cheng et al., Stanford — arXiv:2505.13995

Five-dimension social sycophancy fingerprint: emotional validation, moral endorsement, indirect language, indirect action, and framing acceptance.

SycBench

Rrv et al., 2024 — arXiv:2402.13950

User-pressure robustness on factual questions. Measures whether a model flips from a correct answer when the user pushes back.

Syco-bench

Sharma et al. — Anthropic research

Four-task battery covering free-form feedback, belief-matching, answer switching, and mistake admission.

SYCON-Bench

Kim et al., 2024 — arXiv:2410.09647

Multi-turn sycophancy under sustained social pressure. Scores the model's "turn-to-flip" in extended conversations.

SycEval

Fanous et al., 2025 — arXiv:2502.08177

Domain-specific sycophancy in medical and legal reasoning — the two verticals most exposed to E&O carve-outs.

BrokenMath

Petrov et al., 2025 — arXiv:2506.08270

Mathematical sycophancy: does the model agree with a visibly wrong proof? Hard upper bound on honesty under authoritative framing.

Higher is better

Every score on this page is a resistance score: the percentage of prompts where the model did not give a sycophantic response. A composite of 72 means the model resisted sycophantic traps on 72% of the aggregated corpus.

Week 15 · 2026-04-08

Twenty frontier models, ranked by composite honesty score. Full per-benchmark rows, weekly deltas, and sample size shown. Methodology below.

# Model Composite ELEPHANT SycBench Syco-bench SYCON SycEval BrokenMath Δ wk
1Claude Opus 4.6Anthropic78.481.378.576.874.680.378.60.0
2GPT-5OpenAI74.880.077.372.971.173.374.1−0.1
3Claude Sonnet 4.6Anthropic72.978.173.870.868.475.570.7−0.2
4Gemini 2.5 ProGoogle DeepMind69.272.670.965.765.372.968.1+0.5
5GPT-5 MiniOpenAI66.068.667.164.560.569.765.9−0.2
6Claude Haiku 4.5Anthropic65.869.664.864.264.068.064.4+0.7
7Llama 4 MaverickMeta62.465.264.060.859.862.661.7+0.6
8Mistral Large 3Mistral AI59.261.761.756.955.460.958.7−0.2
9DeepSeek V3.2DeepSeek59.061.159.956.056.760.859.5+0.9
10Gemini 2.5 FlashGoogle DeepMind56.958.157.753.853.459.858.5+0.1
11Qwen 3 MaxAlibaba55.058.456.950.752.056.955.0+0.1
12Grok 4xAI53.055.153.250.949.954.454.6+0.4
13Command R+ 2026Cohere50.553.150.545.747.553.053.1−0.8
14GPT-4oOpenAI50.152.551.145.448.552.750.4+0.3
15Llama 4 ScoutMeta48.549.949.945.046.849.150.1+0.1
16Mistral Medium 3Mistral AI46.848.247.443.745.547.348.4+0.1
17Phi-5Microsoft43.946.744.140.741.445.744.6−0.3
18Gemma 3 27BGoogle41.943.641.938.938.843.444.60.0
19Character 2.0Character.AI32.633.334.931.030.432.533.60.0
20Replika ProLuka Inc.27.828.428.826.924.829.128.5−0.6
Composite — equal-weight mean across all six benchmarks Δ wk — week-over-week change in composite n — per-model sample size ≥ 150 prompts per benchmark

Methodology

How we score, how we weight, and how we keep the leaderboard honest about its own limitations.

  1. Corpus. We draw a stratified sample of 150 prompts per benchmark, per model, per week. Prompts are sampled from the benchmark's own release artifact where available; otherwise regenerated from the published protocol.
  2. Execution. Each model is queried through its public production API with default system prompt, temperature 0.7, and top-p 1.0. We do not use jailbreaks, developer modes, or custom instructions.
  3. Judging. Binary sycophancy labels are assigned by Claude Haiku 4.5 using benchmark-specific rubrics. A 10% sample is hand-verified by two human raters; inter-rater Cohen's κ is published with every release.
  4. Composite. Equal-weight mean across all six benchmarks. We considered variance-weighted and inverse-corpus-size weighting and publish both as supplementary columns in the raw CSV export.
  5. Reproducibility. Every weekly run produces a SHA-256 hash-chained audit log. The log, the prompt corpus, and the per-response judgments are all downloadable from the release's GitHub tag.
  6. Conflict disclosure. Sycoindex is operated by an independent team. We do not accept payment from any vendor whose model appears on this leaderboard. We also score ourselves — see Claude Haiku 4.5, the judge model, on row 6.

What this leaderboard is not

This is not a capability benchmark. A model can rank #1 here and still fail on coding, math, or instruction-following. It is not a safety benchmark in the classical sense — we do not measure refusals, toxicity, or jailbreak resistance. It is narrowly scoped to a single failure mode: telling the user what they want to hear instead of what is true.

Download

Every release ships as a signed artifact. Press, researchers, and counsel are welcome.

Raw CSV · v2026.04-w14

58 KB · SHA-256 published in release

Per-model, per-benchmark, per-prompt judgments. The full corpus behind the composite column.

Methodology PDF

12 pages · versioned

Rubrics, sampling protocol, inter-rater agreement, and every re-weighting we rejected.

Weekly audit log

JSONL · hash-chained

Tamper-evident log of every API call, response, and judgment. Feeds directly into Sycoindex Trust Center.

Download links activate at public launch. Email leaderboard@sycoindex.ai for early access.