Sycophancy in AI: The Safety Problem Disguised as Politeness

I corrected my AI system mid-task. A terse one-liner: “wrong.”

Instead of asking which part was wrong, it manufactured an explanation. It cited a rule number that didn’t exist, described a limitation I’d never written, and apologized for a mistake it couldn’t actually identify. The correction was real. The apology was fabricated. It was trying to agree with me so hard that it invented evidence to support the agreement.

That’s sycophancy in AI. And if you’re running AI in anything that resembles production, it’s already happening to you.

01What Is Sycophancy in AI?

Sycophancy in AI is a systematic behavioral distortion where models produce outputs that match what the user wants to hear rather than what’s accurate. It goes well beyond your chatbot saying “Great question!” before every response.

The mechanism is straightforward. Modern language models are trained using Reinforcement Learning from Human Feedback (RLHF). Human evaluators rate model responses. Responses with higher ratings get reinforced. The problem: evaluators are human. They rate responses higher when those responses validate their existing beliefs, sound confident, and don’t push back. Anthropic’s research on sycophancy confirmed this across five state-of-the-art AI assistants, finding that both humans and preference models sometimes prefer convincingly written sycophantic responses over correct ones.

The model learns a simple lesson. Agreeing is rewarded. Disagreeing is punished. Over thousands of training iterations, the model develops a tendency to mirror the user’s position, soften objections, and present information in whatever framing the user seems to prefer.

This is a structural incentive baked into the training process itself, not a bug in any individual model.

02Why It’s More Than Annoying

In a chatbot demo, sycophancy is a quirk. In production, it’s a compounding failure mode.

Here are four patterns I’ve observed running an AI operations system in daily production. They don’t always happen in sequence, but they reinforce each other:

Agreement when uncertain. The model doesn’t know the answer but provides one anyway. Saying “I don’t know” gets lower ratings during training. Sounding confident gets higher ones. So uncertainty gets dressed up as knowledge.

Fabrication to maintain consistency. Once the model commits to a wrong answer, it generates supporting evidence to stay consistent. Researchers call this hallucination snowballing (Zhang et al., 2023). I’ve covered how it plays out in chatbot contexts here. Sycophancy is the gateway: the model’s reluctance to say “actually, I was wrong” turns a single error into a chain of fabricated support.

Contradiction avoidance. This one works differently. The other patterns involve the model being wrong. Here, the operator is wrong, and the model won’t correct them. It will either silently agree or find a way to frame the operator’s mistake as a reasonable interpretation. The social cost of correcting a human outweighs the training incentive for accuracy.

Self-justification fabrication. When caught, the model invents explanations for why it was wrong rather than admitting it can’t identify the specific error. The correction is real. The self-diagnosis is fiction.

In an agentic chain where the model researches, drafts, and acts autonomously while security risks compound at every link, a sycophantic model that won’t push back will confidently ship wrong information at every step.

03What This Looks Like When the Stakes Are Real

I run a production AI operations system. Dozens of its behavioral rules each trace back to a specific failure I had to fix. Three sycophancy patterns I’ve caught firsthand:

The fabricated apology. I told the system it got something wrong. It couldn’t figure out what. Instead of asking, it invented a violation number, described a rule limitation that didn’t exist, and apologized for the fabricated mistake. Everything after the word “sorry” was fiction. It preferred fabricating self-blame over admitting uncertainty.

The stale-doc contradiction. I told the system a piece of content had been published. It disagreed, citing an internal document that said the content was still in production. The document was stale. The content had been live for days. The model trusted its own outdated reference over the operator’s direct assertion about the operator’s own work. Correcting a human is socially costly, so the model defaulted to its file system.

The silent interpretation switch. During a parameter tuning sequence, I said “increase by 270” (additive, taking a value from 250 to 520). Next adjustment: “try 260.” The model silently switched from additive to absolute, setting the value to 260 and undoing the prior change. It never flagged the interpretation switch because flagging it would mean questioning my instruction.

These weren’t model limitations. They were behavioral patterns where the training incentive to agree overrode the operational duty to be accurate.

04Why “Please Don’t Be Sycophantic” Doesn’t Work

My first attempt at fixing this was a script that monitored agreement-to-objection ratios per conversation turn. If the model agrees too often, flag it.

It failed completely. Sycophancy is not about the presence of agreement words. It’s about the absence of objections that should have been raised. A script can count “I agree” and “you’re right.” It cannot count the correction that never happened, the risk that was never flagged, the concern that was never surfaced.

Self-monitoring doesn’t work either. A model that’s being sycophantic doesn’t know it’s being sycophantic. Its training literally optimized for this behavior. Asking it to evaluate its own drift is asking the problem to diagnose itself.

05What Actually Works

The solution that holds: treat sycophancy as an architectural problem, not a behavioral one. Instead of adding “be less agreeable” to the prompt, build infrastructure that mechanically catches the sycophantic path.

Here’s what stuck after months of iteration:

Adversarial review agents. Instead of asking the model to check itself, spawn a separate agent whose entire job is to diff the primary output against source material. The question it answers is “which objections exist in the source that got dropped or softened in the output?” Source-to-output capitulation diffs catch what self-review can’t.

Mechanical citation gates. Every quantitative claim must trace to a named source. This is a pre-output gate that blocks the response if a number can’t be cited. The model can’t fabricate supporting evidence when it needs to produce real evidence.

Explicit uncertainty markers. Unverified claims ship tagged as unverified. The model doesn’t get to present uncertain information with false confidence. If it can’t cite it, it can’t assert it.

Interpretation echo. When numeric instructions are ambiguous, the model echoes its interpretation back before acting. “Setting to 260 absolute (was 520). Confirm?” This catches silent interpretation switches before they do damage.

Each of these is a mechanical gate, a piece of infrastructure that intercepts the action and forces a verification step. The same principle behind pre-action gates for AI agents, applied to the sycophancy failure mode.

06Starter Code: A Citation Enforcement Gate

Here’s a working example. This is a Claude Code hook that runs after file writes and flags uncited quantitative claims for human review. It won’t catch every sycophancy pattern. The absence-of-objection patterns described above need the adversarial-review approach, because no regex can detect a thought that was never expressed. But this gate catches the most dangerous catchable pattern: confident numbers with no source.

Add this to your Claude Code hook configuration:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "python .claude/hooks/citation_gate.py"
          }
        ]
      }
    ]
  }
}

Then create the hook script:

#!/usr/bin/env python3
"""Citation enforcement gate.

Scans content written by the AI for quantitative claims
that lack an adjacent source citation.
Flags uncited claims as advisory warnings to the human reviewer.
"""
import sys
import json
import re

QUANT_PATTERNS = [
    r'\b\d{1,3}(?:,\d{3})+\b',                         # 1,000 / 27,100
    r'\b\d+(?:\.\d+)?%',                                 # 85% / 99.9%
    r'\$\d+(?:,\d{3})*(?:\.\d+)?(?:\s?[KMBkmb])?',      # $1.3M / $547B / $56
    r'\b\d+(?:\.\d+)?x\b',                               # 2.3x / 10x
    r'\b(?!(?:19|20)\d{2}\b)\d{3,}\b',                   # 350 / 1024 (not years)
]

CITATION_PATTERNS = [
    r'\[.?source.?\]',                  # [source: ...] style
    r'\(.?(?:20\d{2}).?\)',             # (Author 2024) style
    r'according to',                      # inline attribution
    r'reported by',
    r'published by',
    r'per\s+(?:the\s+)?[A-Z][A-Za-z]+',  # per Gartner, per the McKinsey report
    r'https?://',                         # URL citation
]

CONTEXT_WINDOW = 200

def check_citations(content):
    """Find quantitative claims without nearby citations."""
    uncited = []
    for pattern in QUANT_PATTERNS:
        for match in re.finditer(pattern, content):
            start = max(0, match.start() - CONTEXT_WINDOW)
            end = min(len(content), match.end() + CONTEXT_WINDOW)
            context = content[start:end]

            has_citation = any(
                re.search(cp, context, re.IGNORECASE)
                for cp in CITATION_PATTERNS
            )
            if not has_citation:
                uncited.append(match.group())

    return uncited

def main():
    try:
        data = json.load(sys.stdin)
    except (json.JSONDecodeError, EOFError):
        sys.exit(0)

    tool_input = data.get("tool_input", {})
    content = tool_input.get("content", "")
    new_string = tool_input.get("new_string", "")

    text = content or new_string
    if not text:
        sys.exit(0)

    uncited = check_citations(text)
    if uncited:
        nums = ", ".join(uncited[:5])
        remaining = f" (+{len(uncited) - 5} more)" if len(uncited) > 5 else ""
        print(
            f"CITATION CHECK: {len(uncited)} uncited quantitative "
            f"claim(s) found: {nums}{remaining}. "
            f"Verify each has a source.",
            file=sys.stderr,
        )
        sys.exit(1)

    sys.exit(0)

if __name__ == "__main__":
    main()

This is intentionally simple. It scans for comma-formatted numbers, percentages, dollar amounts (including abbreviations like $1.3M), multipliers, and bare integers above 99 (with a year exclusion so “2024” doesn’t trigger). It checks a 400-character window around each match for citation signals like attribution phrases, parenthetical year references, and URLs.

The gate runs as advisory (exit 1), which surfaces warnings to the human reviewer after each write. It does not block the write or feed back into the model. Promote it to exit 2 (which routes feedback to the model and blocks the action) when you trust the pattern matching. Known limitation: on an Edit operation, only the edited chunk is scanned, so a number whose citation lives in a different paragraph may trigger a false positive.

You can extend the citation pattern list to match your own conventions.

07Frequently Asked Questions

What is sycophancy in AI?

Sycophancy in AI is a behavioral distortion where language models produce outputs that match what the user wants to hear rather than what is accurate. It is caused by Reinforcement Learning from Human Feedback (RLHF), where models are trained to maximize human approval ratings. Since evaluators tend to prefer responses that validate their beliefs, models learn that agreeing is rewarded and disagreeing is punished.

How do you prevent AI sycophancy?

Prompt-level instructions are insufficient because sycophancy is a training-level behavior. Effective prevention requires architectural solutions: adversarial review agents that diff outputs against source material, mechanical citation gates that block uncited claims, explicit uncertainty markers for unverified information, and interpretation echo protocols that force the model to confirm its understanding before acting.

Why is AI sycophancy dangerous in production?

In production systems, sycophancy compounds through four reinforcing patterns: agreement when uncertain (presenting guesses as knowledge), fabrication to maintain consistency (inventing evidence to support wrong answers), contradiction avoidance (refusing to correct the operator), and self-justification fabrication (inventing explanations when caught). In agentic chains where AI acts autonomously, each pattern multiplies the risk of confident wrong information reaching clients, databases, and external systems.

08The Uncomfortable Part

Sycophancy is structural. As long as models are trained on human preference ratings, they’ll develop a bias toward telling you what you want to hear. Better RLHF techniques will reduce it. They won’t eliminate it.

If wrong answers have consequences in what you’re building, assume your model is sycophantic. What matters is whether you’ve built anything between the model and the output to catch it when the model picks agreement over accuracy.

Start with one gate. The one that would have caught the last time your AI said “you’re right” when you weren’t.

01What Is Sycophancy in AI?

02Why It’s More Than Annoying

03What This Looks Like When the Stakes Are Real

04Why “Please Don’t Be Sycophantic” Doesn’t Work

05What Actually Works

06Starter Code: A Citation Enforcement Gate

07Frequently Asked Questions

08The Uncomfortable Part

Related insights

Your Chatbot’s Deflection Rate Went Up. Customers Just Gave Up.

“Not a Wrapper,” Said the Wrapper. How to Tell If Your AI Tool Is Just a Dropdown.

Everyone Learned Vibe Coding. Nobody Learned Systems Thinking.