My App

Hallucinations and hallucination detection

Understand what hallucinations are, why they occur in AI applications, and how Blue Guardrails detects them

Large language models sometimes generate content that isn't supported by the information they were given. These errors are called hallucinations. This page explains why they happen, how Blue Guardrails detects them, and how to reduce them in your applications.

Defining hallucinations

A hallucination occurs when a model generates content that isn't consistent with the knowledge it should be drawing from. "Knowledge" can mean different things, and that distinction affects how you detect and fix problems.

Hallucinations from input context happen when the model's output contradicts or isn't supported by what it was given at inference time. If you provide a document saying "Revenue was $10 million" and the model responds "Revenue was $15 million," that's a hallucination. The model had the right information but produced something inconsistent with it. These hallucinations are verifiable because you have the source material.

Hallucinations from training data happen when the model generates content that contradicts what it learned during training. The model might confidently describe a historical event incorrectly, not because it was given wrong information, but because it failed to accurately reproduce what it learned. These are harder to verify because you'd need access to the model's training data.

Factual errors are different from both. A factual error is when content doesn't match real-world truth, regardless of what the model was trained on or given as input. If a model was trained before 2024 and says "The 2024 Olympics haven't happened yet," that's factually wrong but not necessarily a hallucination. The model is being consistent with its training data.

These categories overlap but require different solutions. Factual errors might need retrieval augmentation or knowledge updates. Training-data hallucinations require better training or model selection. Input-context hallucinations require better prompting, context management, or detection systems.

Blue Guardrails focuses on hallucinations from input context. It answers one specific question: does the assistant's response match what the model was given? This includes system prompts, user messages, and tool results. Everything in the conversation history.

Blue Guardrails doesn't verify whether claims are true in the real world. If your system prompt says "The sky is green" and the assistant repeats that claim, Blue Guardrails won't flag it. The assistant accurately reflected its input. Blue Guardrails also doesn't check whether outputs match the model's training data, since that data isn't accessible at runtime.

Input context is the focus because verification is possible. You have the source material. You can check claims against it. This makes detection reliable and actionable. You can fix problems by improving your prompts, retrieval system, or context structure.

Types of hallucinations

Not all hallucinations are alike. Understanding the different types helps you diagnose problems and improve your applications.

Fabrication

Fabrication is the most obvious type. The model invents a claim that appears nowhere in the source material. It might cite a study that doesn't exist or quote a number that wasn't provided. The information is made up.

Reasoning errors

Reasoning errors happen when the model draws incorrect conclusions from valid information. The context might say "Sales increased 10% in Q1 and decreased 5% in Q2." If the model concludes "Sales grew throughout the year," that's a reasoning error. The individual facts are correct, but the inference is wrong.

Conflation

Conflation occurs when the model inappropriately merges information from different sources. Imagine a RAG system with documents from two different companies. If the assistant attributes Company A's revenue to Company B, that's conflation. The numbers are real, but they've been mixed up.

Poor instruction following

Poor instruction following means the model ignores explicit guidance. If your system prompt says "Never mention competitor products" and the assistant recommends a competitor, that's a hallucination of a different kind. The model deviated from its instructions.

Misattribution

Misattribution is when the model credits information to the wrong source. "According to Document A, the policy changed in 2023." In reality, Document B mentioned that policy change. The claim might be accurate, but the citation is wrong.

Common causes

Hallucinations emerge from specific conditions in how applications are built and how models process information.

Poor system prompts are a common cause. If your prompt doesn't make clear that all information should come from provided sources, the model might fill gaps with its training knowledge. Be explicit: "Answer only based on the documents provided. If the information isn't there, say so."

Context without metadata creates problems in RAG applications. When you retrieve document chunks and inject them into a prompt, does the model know where each chunk came from? Without clear source attribution, the model can't cite its sources and you can't verify its claims.

Lack of separation between sources compounds this. If multiple documents are concatenated without clear boundaries, the model might not distinguish where one ends and another begins. Use explicit markers: "--- Document 1: Annual Report ---" helps the model track which claims come from which source.

Context rot affects long conversations and complex agent trajectories. As context grows, models become less reliable at tracking all the information. They might miss details buried in earlier messages or conflate information from different parts of the conversation. Long agent runs are particularly vulnerable.

Using the wrong model for the task is a frequent cause. Smaller, faster models have less capacity for complex reasoning. If your task requires synthesizing information from multiple sources while following detailed instructions, a lightweight model will produce more hallucinations.

How Blue Guardrails detects hallucinations

Blue Guardrails analyzes assistant messages by examining them against everything the model could have seen when generating its response.

When your application sends traces, Blue Guardrails reconstructs the conversation: system prompts, user messages, tool results, and previous assistant responses. This forms the context for evaluation.

For each assistant message, Blue Guardrails creates an evaluation. Think of an evaluation as a review session for one message. It records whether analysis succeeded, what model performed the analysis, and what was found.

Within each evaluation, Blue Guardrails identifies individual annotations. Each annotation marks a specific piece of text that appears to be a hallucination. An annotation includes:

  • The exact text that was flagged
  • A label categorizing the hallucination type
  • An explanation of why it was flagged

This structure lets you see both the big picture (how many messages have issues) and the details (exactly what's wrong and why).

Span-level detection

Many evaluation systems produce a single score: "This response is 73% faithful." Blue Guardrails identifies the specific text spans that contain hallucinations instead.

When Blue Guardrails flags "The study found a 40% improvement," you can search your context for that claim. Did any source mention 40%? Did any source mention a study? You can verify or dismiss each flag independently.

Span-level detection also reveals patterns that guide improvements. If you notice that most hallucinations are misattributions in multi-document queries, you know to improve source separation. If fabrications spike when context exceeds a certain length, you know your chunking strategy needs work. Aggregate scores hide these patterns.

Applications can highlight potentially unreliable text, letting users focus their verification efforts. Instead of distrusting an entire response, users can check the specific claims that need attention. This is especially useful for long responses with dozens of claims.

Each annotation includes character offsets marking exactly where the flagged text appears. This enables precise highlighting in interfaces. See Monitor your AI application for hallucinations for how this works in practice.

Custom configurations

Every domain has different standards for what counts as acceptable model behavior. A legal research assistant might need near-perfect fidelity where every citation is verifiable and every quote is exact. A creative writing assistant can take more liberties. What constitutes a "hallucination" depends on the stakes.

The hallucination types you care about also differ by domain. A medical assistant might need to catch dosage errors and contraindication omissions. A financial assistant might care about numerical precision and regulatory compliance. The default categories might not capture what's relevant to your use case.

Domain context improves detection accuracy. If the detector knows your application processes pharmaceutical documents in a specific format, it can better evaluate claims against that context. Generic detection misses domain-specific nuances.

Blue Guardrails lets you customize both the labels you track and the context the detector uses. You can define categories like "dosage_error" or "citation_missing" that are relevant to your domain. You can provide background that helps the detector understand your specific requirements.

For step-by-step instructions on customizing detection, see Adapt hallucination detection to your use case.

On this page