Sycoindex Sycophancy Leaderboard

The six benchmarks

Every score on this page traces to a published benchmark. We aggregate; we do not invent. Each benchmark is cited to its origin paper and weighted equally (1/6) in the composite unless a re-weighting is explicitly noted.

ELEPHANT

Cheng et al., Stanford — arXiv:2505.13995

Five-dimension social sycophancy fingerprint: emotional validation, moral endorsement, indirect language, indirect action, and framing acceptance.

SycBench

Rrv et al., 2024 — arXiv:2402.13950

User-pressure robustness on factual questions. Measures whether a model flips from a correct answer when the user pushes back.

Syco-bench

Sharma et al. — Anthropic research

Four-task battery covering free-form feedback, belief-matching, answer switching, and mistake admission.

SYCON-Bench

Kim et al., 2024 — arXiv:2410.09647

Multi-turn sycophancy under sustained social pressure. Scores the model's "turn-to-flip" in extended conversations.

SycEval

Fanous et al., 2025 — arXiv:2502.08177

Domain-specific sycophancy in medical and legal reasoning — the two verticals most exposed to E&O carve-outs.

BrokenMath

Petrov et al., 2025 — arXiv:2506.08270

Mathematical sycophancy: does the model agree with a visibly wrong proof? Hard upper bound on honesty under authoritative framing.

Higher is better

Every score on this page is a resistance score: the percentage of prompts where the model did not give a sycophantic response. A composite of 72 means the model resisted sycophantic traps on 72% of the aggregated corpus.

Week 15 · 2026-04-08

Twenty frontier models, ranked by composite honesty score. Full per-benchmark rows, weekly deltas, and sample size shown. Methodology below.

#	Model	Composite	ELEPHANT	SycBench	Syco-bench	SYCON	SycEval	BrokenMath	Δ wk
1	Claude Opus 4.6Anthropic	78.4	81.3	78.5	76.8	74.6	80.3	78.6	0.0
2	GPT-5OpenAI	74.8	80.0	77.3	72.9	71.1	73.3	74.1	−0.1
3	Claude Sonnet 4.6Anthropic	72.9	78.1	73.8	70.8	68.4	75.5	70.7	−0.2
4	Gemini 2.5 ProGoogle DeepMind	69.2	72.6	70.9	65.7	65.3	72.9	68.1	+0.5
5	GPT-5 MiniOpenAI	66.0	68.6	67.1	64.5	60.5	69.7	65.9	−0.2
6	Claude Haiku 4.5Anthropic	65.8	69.6	64.8	64.2	64.0	68.0	64.4	+0.7
7	Llama 4 MaverickMeta	62.4	65.2	64.0	60.8	59.8	62.6	61.7	+0.6
8	Mistral Large 3Mistral AI	59.2	61.7	61.7	56.9	55.4	60.9	58.7	−0.2
9	DeepSeek V3.2DeepSeek	59.0	61.1	59.9	56.0	56.7	60.8	59.5	+0.9
10	Gemini 2.5 FlashGoogle DeepMind	56.9	58.1	57.7	53.8	53.4	59.8	58.5	+0.1
11	Qwen 3 MaxAlibaba	55.0	58.4	56.9	50.7	52.0	56.9	55.0	+0.1
12	Grok 4xAI	53.0	55.1	53.2	50.9	49.9	54.4	54.6	+0.4
13	Command R+ 2026Cohere	50.5	53.1	50.5	45.7	47.5	53.0	53.1	−0.8
14	GPT-4oOpenAI	50.1	52.5	51.1	45.4	48.5	52.7	50.4	+0.3
15	Llama 4 ScoutMeta	48.5	49.9	49.9	45.0	46.8	49.1	50.1	+0.1
16	Mistral Medium 3Mistral AI	46.8	48.2	47.4	43.7	45.5	47.3	48.4	+0.1
17	Phi-5Microsoft	43.9	46.7	44.1	40.7	41.4	45.7	44.6	−0.3
18	Gemma 3 27BGoogle	41.9	43.6	41.9	38.9	38.8	43.4	44.6	0.0
19	Character 2.0Character.AI	32.6	33.3	34.9	31.0	30.4	32.5	33.6	0.0
20	Replika ProLuka Inc.	27.8	28.4	28.8	26.9	24.8	29.1	28.5	−0.6

Composite — equal-weight mean across all six benchmarks Δ wk — week-over-week change in composite n — per-model sample size ≥ 150 prompts per benchmark

Methodology

How we score, how we weight, and how we keep the leaderboard honest about its own limitations.

Corpus. We draw a stratified sample of 150 prompts per benchmark, per model, per week. Prompts are sampled from the benchmark's own release artifact where available; otherwise regenerated from the published protocol.
Execution. Each model is queried through its public production API with default system prompt, temperature 0.7, and top-p 1.0. We do not use jailbreaks, developer modes, or custom instructions.
Judging. Binary sycophancy labels are assigned by Claude Haiku 4.5 using benchmark-specific rubrics. A 10% sample is hand-verified by two human raters; inter-rater Cohen's κ is published with every release.
Composite. Equal-weight mean across all six benchmarks. We considered variance-weighted and inverse-corpus-size weighting and publish both as supplementary columns in the raw CSV export.
Reproducibility. Every weekly run produces a SHA-256 hash-chained audit log. The log, the prompt corpus, and the per-response judgments are all downloadable from the release's GitHub tag.
Conflict disclosure. Sycoindex is operated by an independent team. We do not accept payment from any vendor whose model appears on this leaderboard. We also score ourselves — see Claude Haiku 4.5, the judge model, on row 6.

What this leaderboard is not

This is not a capability benchmark. A model can rank #1 here and still fail on coding, math, or instruction-following. It is not a safety benchmark in the classical sense — we do not measure refusals, toxicity, or jailbreak resistance. It is narrowly scoped to a single failure mode: telling the user what they want to hear instead of what is true.

Download

Every release ships as a signed artifact. Press, researchers, and counsel are welcome.

Raw CSV · v2026.04-w14

58 KB · SHA-256 published in release

Per-model, per-benchmark, per-prompt judgments. The full corpus behind the composite column.

Methodology PDF

12 pages · versioned

Rubrics, sampling protocol, inter-rater agreement, and every re-weighting we rejected.

Weekly audit log

JSONL · hash-chained

Tamper-evident log of every API call, response, and judgment. Feeds directly into Sycoindex Trust Center.

Download links activate at public launch. Email leaderboard@sycoindex.ai for early access.