My App

Use experiments to compare model hallucination rates

Run conversations through different models and compare their hallucination behavior.

This guide shows you how to compare hallucination rates across different models using experiments. Experiments replay your traced conversations through a target model and analyze the responses for hallucinations. They use n-1 replay: the conversation minus the last assistant message gets sent to the model, which generates its own response. This lets you test how different models perform on your actual production conversations before switching.

Prerequisites

  • A Blue Guardrails account with a workspace
  • Traced conversations in your workspace
  • Sufficient credits in your balance to run the experiment

Create an experiment

  1. Click Experiments in the sidebar.
  2. Click New Experiment.
  3. Enter a name (e.g., "GPT-4o Baseline Test").
  4. Select a model from the dropdown.
  5. If you want to filter conversations, set start and end dates using the calendar pickers. You can also use quick presets like "Last 7 Days" or "Last 30 Days".
  6. Check the preview to see how many conversations match your filters. Experiments include up to 100 conversations.
  7. Create the experiment.

Monitor experiment progress

The experiments list shows each experiment's status and progress.

  • Running means the experiment is still processing conversations.
  • The progress column shows completed items (e.g., "45/100").
  • The page auto-updates while experiments run.

Wait for the status to change to Successful, Partially Failed, or Failed.

Review experiment results

  1. Click an experiment row to open its detail view.
  2. Review the Hallucinations card:
    • Total hallucinations - count across all responses
    • Hallucination rate - percentage of responses with at least one hallucination
    • Avg. hallucinations per prompt - average per conversation
    • Hallucination-free responses - responses with zero hallucinations
  3. Review the Tokens card:
    • Total input tokens and Total output tokens - overall token usage
    • Avg. input tokens and Avg. output tokens - averages per generated response
  4. Review the Cost & Latency card:
    • Cost per 1k prompts - estimated cost to run 1,000 similar conversations through this model
    • Total generation time - time to generate all responses with the experiment's model
    • Avg. response latency and Max. response latency - how long the experiment's model took to generate responses
  5. Check the Hallucination types chart to see which categories appear most.
  6. If you want to investigate specific responses, click a conversation row to see the original input and the model's response with highlighted hallucinations.

Compare multiple experiments

To compare hallucination rates across models, create experiments with different target models using the same date filters so each experiment runs with the same conversations.

  1. Select up to 3 completed experiments using the checkboxes on the left.
  2. Click Compare.
  3. Review the comparison dashboard:
    • Metric cards show each experiment's model, conversation count, hallucination rate, and cost per 1k prompts side by side.
    • The hallucination types chart shows the distribution for each experiment.
  4. Click a row in the comparison table to see the conversation and each model's generated response with detected hallucinations side by side.

On this page