Debugging LLM Agent Tool Calls with Distributed Traces (Run IDs, Spans, Failures) | Jaeger

Written by

Jaeger for LLM agents: Jaeger is one of the easiest ways to see what your LLM agent actually did in production. When an agent fails, the final answer rarely tells you the real story. The story is in the timeline: planning, tool selection, retries, RAG retrieval, and downstream service latency.

In this guide, we’ll build a practical Jaeger workflow for debugging tool calls and multi-step agent runs using OpenTelemetry. We’ll focus on what teams need in real systems: searchability (run_id), safe logging, and fast incident triage.

TL;DR

Trace = 1 user request / 1 agent run.
Span = each step (plan, tool call, retrieval, final).
Add run_id, tool.name, llm.model, prompt.version as span attributes so Jaeger search works.
Keep 100% of error traces (tail sampling) and downsample the rest.
Don’t store raw prompts/tool args in production by default; store summaries/hashes + strict RBAC.

What Jaeger is (and what it is not)
Why Jaeger is great for agent debugging
Span model for tool calling and RAG
How to find the right trace fast (run_id workflow)
Common failure patterns Jaeger reveals
Setup overview: OTel → Collector → Jaeger
Privacy + redaction guidance
Tools & platforms (official + GitHub links)
Production checklist
FAQ

What Jaeger is (and what it is not)

Jaeger is an open-source distributed tracing backend. It stores traces (spans), provides a UI to explore timelines, and helps you understand request flows across services.

Jaeger is not a complete observability platform by itself. Most teams pair it with metrics (Prometheus/Grafana) and logs (ELK/OpenSearch/Loki). For LLM agents, Jaeger is the best “trace-first” entry point because timelines are how agent failures present.

Why Jaeger is great for agent debugging

Request narrative: agents are sequential + branching systems. Traces show the narrative.
Root-cause speed: instantly spot if the tool call timed out vs. the model stalled.
Cross-service visibility: planner service → tool service → DB → third-party API, all in one view.

Span model for tool calling and RAG

Start with a consistent span naming convention. Example:

trace (run_id=R123)
  span: agent.plan
  span: llm.generate (model=gpt-4.1)
  span: tool.search (tool.name=web_search)
  span: tool.search.result (http.status=200)
  span: rag.retrieve (top_k=10)
  span: final.compose

Recommended attributes (keep them structured):

run_id (critical: makes incident triage fast)
tool.name, tool.type, tool.status, http.status_code
llm.provider, llm.model, llm.tokens_in, llm.tokens_out
prompt.version or prompt.hash
rag.top_k, rag.source, rag.hit_count (avoid raw retrieved content)

How to find the right trace fast (run_id workflow)

The cleanest workflow is: your app logs a run_id for each user request, and Jaeger traces carry the same attribute. Then you can search Jaeger by run_id and open the exact trace in seconds.

Log run_id at request start and return it in API responses for support tickets.
Add run_id as a span attribute on the root span (and optionally all spans).
Use Jaeger search to filter by run_id, error=true, or tool.name.

Common failure patterns Jaeger reveals

1) Broken context propagation (fragmented traces)

If tool calls run as separate services, missing trace propagation breaks the timeline. You’ll see disconnected traces instead of one end-to-end trace. Fix: propagate trace headers (W3C Trace Context) into tool HTTP calls or internal RPC.

2) “Tool call succeeded” but agent still failed

This often indicates parsing/validation issues (schema mismatch), prompt regression, or poor retrieval. The trace shows tool latency is fine; failure happens in the LLM generation span or post-processing span.

3) Slow runs caused by retries

Retries add up. In Jaeger, you’ll see repeated tool spans. Add attributes like retry.count and retry.reason to make it obvious.

Setup overview: OTel → Collector → Jaeger

A simple production-friendly architecture is:

Agent Runtime (OTel SDK)  ->  OTel Collector  ->  Jaeger (storage + UI)

Export OTLP from your agent to the Collector, apply tail sampling + redaction there, and export to Jaeger.

Privacy + redaction guidance

Do not store raw prompts/tool arguments by default in production traces.
Store summaries, hashes, or classified metadata (e.g., “contains_pii=true”) instead.
Keep detailed logging behind feature flags, short retention, and strict RBAC.

Tools & platforms (official + GitHub links)

Jaeger: jaegertracing.io | GitHub
OpenTelemetry: opentelemetry.io
OpenTelemetry Collector: GitHub

Production checklist

Define a span naming convention + attribute schema (run_id, tool attributes, model info).
Propagate trace context into tool calls (headers/middleware).
Use tail sampling to keep full traces for failures/slow runs.
Redact PII/secrets and restrict access to sensitive trace fields.
Train the team on a basic incident workflow: “get run_id → find trace → identify slow/error span → fix.”

FAQ

Jaeger vs Tempo: which should I use?

If you want a straightforward tracing backend with a classic trace UI, Jaeger is a strong default. If you expect very high volume and want object-storage economics, Tempo can be a better fit (especially with Grafana).

Author’s Bio

Vineet Tiwari is an accomplished Solution Architect with over 5 years of experience in AI, ML, Web3, and Cloud technologies. Specializing in Large Language Models (LLMs) and blockchain systems, he excels in building secure AI solutions and custom decentralized platforms tailored to unique business needs.

Vineet’s expertise spans cloud-native architectures, data-driven machine learning models, and innovative blockchain implementations. Passionate about leveraging technology to drive business transformation, he combines technical mastery with a forward-thinking approach to deliver scalable, secure, and cutting-edge solutions. With a strong commitment to innovation, Vineet empowers businesses to thrive in an ever-evolving digital landscape.

Debugging LLM Agent Tool Calls with Distributed Traces (Run IDs, Spans, Failures) | Jaeger

TL;DR

Table of Contents

What Jaeger is (and what it is not)

Why Jaeger is great for agent debugging

Span model for tool calling and RAG

How to find the right trace fast (run_id workflow)

Common failure patterns Jaeger reveals

1) Broken context propagation (fragmented traces)

2) “Tool call succeeded” but agent still failed

3) Slow runs caused by retries

Setup overview: OTel → Collector → Jaeger

Privacy + redaction guidance

Tools & platforms (official + GitHub links)

Production checklist

FAQ

Jaeger vs Tempo: which should I use?

Related reads on aivineet

Author’s Bio

Comments

Leave a Reply Cancel reply

More posts

KittenTTS: Tiny Open-Source Text-to-Speech That Runs on CPU

Web 4.0 Explained: Conway, x402, and the Internet Built for AI Agents

Simile Raises $100M to Simulate Human Behavior — Why This Could Be the Missing Layer for AI Agents

DialogLab: Simulating and Testing Dynamic Human‑AI Group Conversations (Google Research + UIST 2025)