Tracing and OpenTelemetry
Understand how distributed tracing works and how Blue Guardrails uses OpenTelemetry to capture LLM conversations
Blue Guardrails uses distributed tracing to capture every interaction between your application and LLMs. This page explains how tracing works and how Blue Guardrails uses it.
Tracing basics
Tracing records the path of a request through a system. Think of it like tracking a package through multiple shipping facilities. Each stop gets logged with a timestamp and relevant details.
In software, tracing originated in microservices architectures. When a single user request touches multiple services, tracing connects the dots. It answers: what happened, where, and how long did it take?
For LLM applications, tracing serves a similar purpose. Instead of tracking HTTP requests across services, you're tracking conversations with AI models. Every prompt and response gets recorded.
Traces, spans, and their relationship
A trace represents one complete operation from start to finish. It's the full journey of a request through your system. Every trace has a unique identifier called a trace_id.
A span is a single unit of work within a trace. Spans have a name, start time, end time, and key-value attributes that describe what happened. Each span has its own span_id.
Spans nest inside each other. A parent span might represent an entire conversation, while child spans represent individual model calls. This creates a tree structure showing how operations relate.
Trace (trace_id: abc123)
├── Span: "user-conversation" (parent)
│ ├── Span: "llm-call-1" (child)
│ └── Span: "llm-call-2" (child)The tree structure reveals dependencies. If a parent span fails, you can see which child operation caused the problem.
Tracing benefits
In traditional observability, tracing helps you:
- Find slow operations by seeing which component took the longest
- Trace errors back to their source
- Understand how services depend on each other
- Correlate logs from different systems
For LLM applications, you get additional benefits:
- See every message sent to and received from models
- Track token usage and costs across conversations
- Debug unexpected model behavior by reviewing exact inputs
- Reconstruct multi-turn conversations for analysis
- Detect hallucinations by comparing outputs to source data
OpenTelemetry: the standard
OpenTelemetry (OTel) is an open standard for observability. It provides standardized APIs, SDKs, and data formats for collecting telemetry data.
Before OpenTelemetry, each vendor had its own format. You'd instrument your code for one vendor and get locked in. Switching providers meant rewriting instrumentation.
OpenTelemetry solves this. Instrument once, send data anywhere. The same code works with any compatible backend.
OTLP (OpenTelemetry Protocol) is how telemetry data gets sent over the network. It defines how traces, metrics, and logs get serialized.
Blue Guardrails accepts OTLP traces at /v1/traces. Any OTel-compatible SDK works without custom integration code.
GenAI semantic conventions
OpenTelemetry defines semantic conventions for GenAI operations. These conventions standardize how you record LLM interactions.
Standardization solves a real problem: different LLM providers use different terminology. OpenAI calls it "messages," Anthropic uses "content." Without conventions, every tool would need custom parsers for each provider.
Semantic conventions define attribute names that work across providers:
| Attribute | Purpose | Example |
|---|---|---|
gen_ai.system | Provider name | openai, anthropic |
gen_ai.request.model | Model identifier | gpt-4, claude-3-opus |
gen_ai.input.messages | Messages sent to model | Array of message objects |
gen_ai.output.messages | Model's response | Array of message objects |
gen_ai.usage.input_tokens | Tokens in the prompt | 1523 |
gen_ai.usage.output_tokens | Tokens generated | 847 |
gen_ai.response.id | Provider's response ID | chatcmpl-abc123 |
gen_ai.conversation.id | Multi-turn conversation ID | conv-xyz789 |
Blue Guardrails follows these conventions. When you send traces using any compliant SDK, Blue Guardrails understands them automatically.
A GenAI span example
Here's what a span looks like when a user asks a question and receives a response:
{
"trace_id": "a1b2c3d4e5f6...",
"span_id": "1a2b3c4d...",
"name": "chat gpt-4",
"start_time": "2024-01-15T10:30:00Z",
"end_time": "2024-01-15T10:30:02Z",
"attributes": {
"gen_ai.system": "openai",
"gen_ai.request.model": "gpt-5-mini",
"gen_ai.usage.input_tokens": 42,
"gen_ai.usage.output_tokens": 156,
"gen_ai.input.messages": [
{
"role": "system",
"parts": [{"type": "text", "content": "You are a helpful assistant."}]
},
{
"role": "user",
"parts": [{"type": "text", "content": "What is the capital of France?"}]
}
],
"gen_ai.output.messages": [
{
"role": "assistant",
"parts": [{"type": "text", "content": "The capital of France is Paris."}]
}
]
}
}This span shows OpenAI's GPT-5 Mini processed 42 input tokens and generated 156 output tokens. The conversation includes the system prompt, user question, and assistant response. The call took about 2 seconds.
Tool calls in spans
When models call tools, the span captures both the request and response.
A tool call appears in gen_ai.output.messages:
{
"role": "assistant",
"parts": [
{
"type": "tool_call",
"id": "call_abc123",
"name": "get_weather",
"arguments": {"location": "Paris"}
}
]
}The tool response appears in gen_ai.input.messages of the next span:
{
"role": "tool",
"parts": [
{
"type": "tool_call_response",
"id": "call_abc123",
"name": "get_weather",
"response": "{\"temp\": 18, \"conditions\": \"sunny\"}"
}
]
}The id field links tool calls to their responses. This lets Blue Guardrails reconstruct multi-step conversations where the model uses tools to gather information before responding.
Instrumentation
Instrumentation is code that creates spans. You have two options:
Manual instrumentation means writing code to create spans explicitly. You control exactly what gets recorded. Use this for custom operations.
Auto-instrumentation uses libraries that wrap SDK calls automatically. When you call an LLM, the library creates a span without extra code from you. It's faster to set up and covers common cases.
For LLM applications, auto-instrumentation typically captures:
- Input messages and system prompts
- Model selection and parameters (temperature, max tokens)
- Output messages and tool calls
- Token counts
- Timing and errors
Start with auto-instrumentation. Add manual spans for custom logic.
OTel SDKs and Logfire
To send traces, you need an SDK. OpenTelemetry provides official SDKs for Python, JavaScript, Go, Java, and more. These handle span creation, context propagation, and export.
Logfire is an observability platform from the Pydantic team. It's built on OpenTelemetry and provides a polished SDK with excellent GenAI support out of the box.
Blue Guardrails works well with Logfire because it includes pre-built instrumentation for popular LLM providers, and follows OpenTelemetry's semantic conventions for GenAI.
You can configure Logfire to send traces to Blue Guardrails instead of (or alongside) the Logfire platform. Point the OTLP exporter at Blue Guardrails' /v1/traces endpoint.
Any OTel-compatible SDK works. Logfire is a good choice if you want quick setup with minimal configuration.
Trace processing
Blue Guardrails only processes spans containing GenAI attributes. HTTP spans, database queries, and other traditional telemetry pass through without special handling.
When Blue Guardrails receives a trace, it:
- Stores the raw span data for debugging
- Identifies spans with
gen_ai.*attributes - Normalizes different SDK formats to a consistent structure
- Extracts messages, token counts, and metadata
Blue Guardrails supports traces from:
- Pydantic-AI
- Logfire with OpenAI instrumentation
- Logfire with Anthropic instrumentation
- Google GenAI
- LangChain/LangSmith
- Haystack (via Logfire instrumentation of underlying SDKs)
Different SDKs structure data differently. Blue Guardrails normalizes these variations so your dashboards and analysis work consistently, regardless of which SDK you use. Other frameworks might work too. Try instrumenting your code and submit in-app feedback if your conversations don't show up as expected.
From traces to conversations
Blue Guardrails groups spans into conversations. A conversation contains all messages exchanged in a session, ordered chronologically.
The grouping logic works like this:
- If spans have a
gen_ai.conversation.id, they're grouped together - Otherwise, spans sharing a
trace_idform a conversation - Tool calls get linked across spans using their
idfield
Once grouped, Blue Guardrails:
- Assembles the full message history
- Identifies assistant responses
- Queues messages for hallucination detection
This means your existing tracing setup feeds directly into hallucination detection.