Tracing and OpenTelemetry

Understand how distributed tracing works and how Blue Guardrails uses OpenTelemetry to capture LLM conversations

Blue Guardrails uses distributed tracing to capture every interaction between your application and LLMs. This page explains how tracing works and how Blue Guardrails uses it.

Tracing basics

Tracing records the path of a request through a system. Think of it like tracking a package through multiple shipping facilities. Each stop gets logged with a timestamp and relevant details.

In software, tracing originated in microservices architectures. When a single user request touches multiple services, tracing connects the dots. It answers: what happened, where, and how long did it take?

For LLM applications, tracing serves a similar purpose. Instead of tracking HTTP requests across services, you're tracking conversations with AI models. Every prompt and response gets recorded.

Traces, spans, and their relationship

A trace represents one complete operation from start to finish. It's the full journey of a request through your system. Every trace has a unique identifier called a trace_id.

A span is a single unit of work within a trace. Spans have a name, start time, end time, and key-value attributes that describe what happened. Each span has its own span_id.

Spans nest inside each other. A parent span might represent an entire conversation, while child spans represent individual model calls. This creates a tree structure showing how operations relate.

Trace (trace_id: abc123)
├── Span: "user-conversation" (parent)
│   ├── Span: "llm-call-1" (child)
│   └── Span: "llm-call-2" (child)

The tree structure reveals dependencies. If a parent span fails, you can see which child operation caused the problem.

Tracing benefits

In traditional observability, tracing helps you:

Find slow operations by seeing which component took the longest
Trace errors back to their source
Understand how services depend on each other
Correlate logs from different systems

For LLM applications, you get additional benefits:

See every message sent to and received from models
Track token usage and costs across conversations
Debug unexpected model behavior by reviewing exact inputs
Reconstruct multi-turn conversations for analysis
Detect hallucinations by comparing outputs to source data

OpenTelemetry: the standard

OpenTelemetry (OTel) is an open standard for observability. It provides standardized APIs, SDKs, and data formats for collecting telemetry data.

Before OpenTelemetry, each vendor had its own format. You'd instrument your code for one vendor and get locked in. Switching providers meant rewriting instrumentation.

OpenTelemetry solves this. Instrument once, send data anywhere. The same code works with any compatible backend.

OTLP (OpenTelemetry Protocol) is how telemetry data gets sent over the network. It defines how traces, metrics, and logs get serialized.

Blue Guardrails accepts OTLP traces at /v1/traces. Any OTel-compatible SDK works without custom integration code.

GenAI semantic conventions

OpenTelemetry defines semantic conventions for GenAI operations. These conventions standardize how you record LLM interactions.

Standardization solves a real problem: different LLM providers use different terminology. OpenAI calls it "messages," Anthropic uses "content." Without conventions, every tool would need custom parsers for each provider.

Semantic conventions define attribute names that work across providers:

Attribute	Purpose	Example
`gen_ai.system`	Provider name	`openai`, `anthropic`
`gen_ai.request.model`	Model identifier	`gpt-4`, `claude-3-opus`
`gen_ai.input.messages`	Messages sent to model	Array of message objects
`gen_ai.output.messages`	Model's response	Array of message objects
`gen_ai.usage.input_tokens`	Tokens in the prompt	`1523`
`gen_ai.usage.output_tokens`	Tokens generated	`847`
`gen_ai.response.id`	Provider's response ID	`chatcmpl-abc123`
`gen_ai.conversation.id`	Multi-turn conversation ID	`conv-xyz789`

Blue Guardrails follows these conventions. When you send traces using any compliant SDK, Blue Guardrails understands them automatically.

A GenAI span example

Here's what a span looks like when a user asks a question and receives a response:

{
  "trace_id": "a1b2c3d4e5f6...",
  "span_id": "1a2b3c4d...",
  "name": "chat gpt-4",
  "start_time": "2024-01-15T10:30:00Z",
  "end_time": "2024-01-15T10:30:02Z",
  "attributes": {
    "gen_ai.system": "openai",
    "gen_ai.request.model": "gpt-5-mini",
    "gen_ai.usage.input_tokens": 42,
    "gen_ai.usage.output_tokens": 156,
    "gen_ai.input.messages": [
      {
        "role": "system",
        "parts": [{"type": "text", "content": "You are a helpful assistant."}]
      },
      {
        "role": "user",
        "parts": [{"type": "text", "content": "What is the capital of France?"}]
      }
    ],
    "gen_ai.output.messages": [
      {
        "role": "assistant",
        "parts": [{"type": "text", "content": "The capital of France is Paris."}]
      }
    ]
  }
}

This span shows OpenAI's GPT-5 Mini processed 42 input tokens and generated 156 output tokens. The conversation includes the system prompt, user question, and assistant response. The call took about 2 seconds.

Tool calls in spans

When models call tools, the span captures both the request and response.

A tool call appears in gen_ai.output.messages:

{
  "role": "assistant",
  "parts": [
    {
      "type": "tool_call",
      "id": "call_abc123",
      "name": "get_weather",
      "arguments": {"location": "Paris"}
    }
  ]
}

The tool response appears in gen_ai.input.messages of the next span:

{
  "role": "tool",
  "parts": [
    {
      "type": "tool_call_response",
      "id": "call_abc123",
      "name": "get_weather",
      "response": "{\"temp\": 18, \"conditions\": \"sunny\"}"
    }
  ]
}

The id field links tool calls to their responses. This lets Blue Guardrails reconstruct multi-step conversations where the model uses tools to gather information before responding.

Instrumentation

Instrumentation is code that creates spans. You have two options:

Manual instrumentation means writing code to create spans explicitly. You control exactly what gets recorded. Use this for custom operations.

Auto-instrumentation uses libraries that wrap SDK calls automatically. When you call an LLM, the library creates a span without extra code from you. It's faster to set up and covers common cases.

For LLM applications, auto-instrumentation typically captures:

Input messages and system prompts
Model selection and parameters (temperature, max tokens)
Output messages and tool calls
Token counts
Timing and errors

Start with auto-instrumentation. Add manual spans for custom logic.

OTel SDKs and Logfire

To send traces, you need an SDK. OpenTelemetry provides official SDKs for Python, JavaScript, Go, Java, and more. These handle span creation, context propagation, and export.

Logfire is an observability platform from the Pydantic team. It's built on OpenTelemetry and provides a polished SDK with excellent GenAI support out of the box.

Blue Guardrails works well with Logfire because it includes pre-built instrumentation for popular LLM providers, and follows OpenTelemetry's semantic conventions for GenAI.

You can configure Logfire to send traces to Blue Guardrails instead of (or alongside) the Logfire platform. Point the OTLP exporter at Blue Guardrails' /v1/traces endpoint.

Any OTel-compatible SDK works. Logfire is a good choice if you want quick setup with minimal configuration.

Trace processing

Blue Guardrails only processes spans containing GenAI attributes. HTTP spans, database queries, and other traditional telemetry pass through without special handling.

When Blue Guardrails receives a trace, it:

Stores the raw span data for debugging
Identifies spans with gen_ai.* attributes
Normalizes different SDK formats to a consistent structure
Extracts messages, token counts, and metadata

Blue Guardrails supports traces from:

Pydantic-AI
Logfire with OpenAI instrumentation
Logfire with Anthropic instrumentation
Google GenAI
LangChain/LangSmith
Haystack (via Logfire instrumentation of underlying SDKs)

Different SDKs structure data differently. Blue Guardrails normalizes these variations so your dashboards and analysis work consistently, regardless of which SDK you use. Other frameworks might work too. Try instrumenting your code and submit in-app feedback if your conversations don't show up as expected.

From traces to conversations

Blue Guardrails groups spans into conversations. A conversation contains all messages exchanged in a session, ordered chronologically.

The grouping logic works like this:

If spans have a gen_ai.conversation.id, they're grouped together
Otherwise, spans sharing a trace_id form a conversation
Tool calls get linked across spans using their id field

Once grouped, Blue Guardrails:

Assembles the full message history
Identifies assistant responses
Queues messages for hallucination detection

This means your existing tracing setup feeds directly into hallucination detection.

Tracing and OpenTelemetry

On this page