Experiments and comparisons

Understand how Blue Guardrails replays conversations through different models to compare hallucination behavior

Choosing the right model for your application involves tradeoffs. Faster models cost less but might hallucinate more. Newer models promise better accuracy but behave differently. The challenge is comparing models fairly on your actual workload.

Experiments solve this by replaying your actual production conversations through different models and measuring the results.

The case for model comparison

Different models hallucinate at different rates. A model that performs well on benchmarks might struggle with your specific use case. The only way to know is to test on your data.

Model choice affects more than accuracy. Cost per request, response latency, and output style all vary. A model that's twice as expensive might not be twice as accurate for your needs. You need data to make informed decisions.

Testing on synthetic data doesn't reflect production behavior. Your real conversations have specific patterns: the way users phrase questions, the complexity of tool results, the length of context windows. Synthetic tests miss these nuances.

Experiment workflow

An experiment takes conversations you've already traced and replays them through a target model. Here's the process:

Select conversations from your traced data using date filters
Send each conversation's context to the target model
Have the target model generate a new response
Run hallucination detection on each generated response
Calculate aggregate metrics across all responses

The result is a detailed view of how the target model performs on your actual workload.

The n-1 replay method

Experiments use a technique called n-1 replay. The name describes what happens: all messages except the final assistant message get sent to the new model.

Consider a traced conversation with five messages:

┌─────────────────────────────────────────┐
│ 1. System: "You are a helpful..."       │
├─────────────────────────────────────────┤
│ 2. User: "What's in this document?"     │
├─────────────────────────────────────────┤
│ 3. Assistant: [tool_call: read_doc]     │
├─────────────────────────────────────────┤
│ 4. Tool: {"title": "Q3 Report", ...}    │
├─────────────────────────────────────────┤
│ 5. Assistant: "The document shows..."   │  ← Original response
└─────────────────────────────────────────┘

In an experiment, messages 1 through 4 become the context. The target model generates its own version of message 5:

┌─────────────────────────────────────────┐
│ Context (messages 1-4)                  │
│ ┌─────────────────────────────────────┐ │
│ │ System + User + Tool call + Result  │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│ Target model generates new response     │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│ Blue Guardrails detects hallucinations  │
└─────────────────────────────────────────┘

The target model sees everything the original model saw at the moment it generated its response. System prompts, user messages, tool calls, tool results, and any prior assistant messages in multi-turn conversations. Nothing is hidden or modified.

N-1 replay benefits

This method creates a controlled comparison. Both models face the exact same decision point with identical information. Any difference in their responses comes from the models themselves, not from variations in input.

Tool results are particularly valuable here. When a tool returns structured data like {"revenue": 15000000}, that's verifiable ground truth. If the target model says "revenue was $15 million," that's accurate. If it says "revenue was $20 million," that's a detectable hallucination.

The full conversation history is equally important. In multi-turn conversations, earlier messages establish context that shapes how the model should respond. By preserving this history, experiments test how well models track information across turns.

Conversation selection

Experiments draw from your traced conversations. These are real interactions your application has already processed, captured through OpenTelemetry tracing.

You can filter conversations by date range to focus on specific periods. If you're comparing models after a prompt change, filter to conversations before and after the change. If you're investigating a spike in hallucinations, filter to that time window.

Only conversations with at least one assistant message are eligible. The experiment needs something to regenerate. Conversations that ended before the model responded aren't useful for comparison.

When many conversations match your filters, experiments sample randomly. This gives you a representative slice without processing every conversation. The sample size balances statistical relevance with processing time and cost.

Comparing experiments

A single experiment tells you how one model performs. Comparisons reveal how models differ.

Create experiments with the same date filters but different target models. This ensures each experiment processes the same set of conversations, making the results directly comparable.

Comparisons show several dimensions of difference.

Hallucination rates reveal which model produces fewer errors. A model with a 5% hallucination rate is more reliable than one with 15%, assuming similar workloads.

Hallucination types show what kind of errors each model makes. One model might fabricate facts while another misattributes sources. The type of error can be more relevant than the rate for your use case.

Cost and latency help you weigh accuracy against operational concerns. A model that's 2% more accurate but costs 5x more might not be the right choice.

Comparison data

The comparison view presents data at three levels of detail.

The metrics summary shows aggregate statistics for each experiment side by side: hallucination rate, total count, and cost per thousand prompts.

The conversation matrix lists individual conversations with hallucination counts per experiment. You can see which specific conversations caused more problems for which models. This reveals patterns—maybe one model struggles with long tool results while another handles them well.

The detail view shows the actual messages. Pick a conversation and see the original context alongside each model's generated response. Detected hallucinations are highlighted in the text.

When to use experiments

Run experiments before switching models in production. Test the new model on your actual conversations before committing. A few hours of experiment data can prevent production incidents.

Run experiments when evaluating new model releases. Model providers release updates regularly. Experiments verify that improvements on benchmarks translate to improvements on your workload.

Run experiments to validate model choice for specific use cases. If you're building a new feature with different requirements, test candidate models against relevant conversations.

Run experiments after changing prompts or context structure. Prompt changes affect model behavior. Experiments let you measure the impact across your conversation base.

For step-by-step instructions, see Use experiments to compare model hallucination rates.

Experiments and comparisons

On this page