Debugging LLM Agent Tool Calls with Distributed Traces (Run IDs, Spans, Failures) | Jaeger

Jaeger for LLM agents: Jaeger is one of the easiest ways to see what your LLM agent actually did in production. When an agent fails, the final answer rarely tells you the real story. The story is in the timeline: planning, tool selection, retries, RAG retrieval, and downstream service latency.

Jaeger for LLM agents

In this guide, we’ll build a practical Jaeger workflow for debugging tool calls and multi-step agent runs using OpenTelemetry. We’ll focus on what teams need in real systems: searchability (run_id), safe logging, and fast incident triage.

TL;DR

  • Trace = 1 user request / 1 agent run.
  • Span = each step (plan, tool call, retrieval, final).
  • Add run_id, tool.name, llm.model, prompt.version as span attributes so Jaeger search works.
  • Keep 100% of error traces (tail sampling) and downsample the rest.
  • Don’t store raw prompts/tool args in production by default; store summaries/hashes + strict RBAC.

Table of Contents

What Jaeger is (and what it is not)

Jaeger is an open-source distributed tracing backend. It stores traces (spans), provides a UI to explore timelines, and helps you understand request flows across services.

Jaeger is not a complete observability platform by itself. Most teams pair it with metrics (Prometheus/Grafana) and logs (ELK/OpenSearch/Loki). For LLM agents, Jaeger is the best “trace-first” entry point because timelines are how agent failures present.

Why Jaeger is great for agent debugging

  • Request narrative: agents are sequential + branching systems. Traces show the narrative.
  • Root-cause speed: instantly spot if the tool call timed out vs. the model stalled.
  • Cross-service visibility: planner service → tool service → DB → third-party API, all in one view.

Span model for tool calling and RAG

Start with a consistent span naming convention. Example:

trace (run_id=R123)
  span: agent.plan
  span: llm.generate (model=gpt-4.1)
  span: tool.search (tool.name=web_search)
  span: tool.search.result (http.status=200)
  span: rag.retrieve (top_k=10)
  span: final.compose

Recommended attributes (keep them structured):

  • run_id (critical: makes incident triage fast)
  • tool.name, tool.type, tool.status, http.status_code
  • llm.provider, llm.model, llm.tokens_in, llm.tokens_out
  • prompt.version or prompt.hash
  • rag.top_k, rag.source, rag.hit_count (avoid raw retrieved content)

The cleanest workflow is: your app logs a run_id for each user request, and Jaeger traces carry the same attribute. Then you can search Jaeger by run_id and open the exact trace in seconds.

  • Log run_id at request start and return it in API responses for support tickets.
  • Add run_id as a span attribute on the root span (and optionally all spans).
  • Use Jaeger search to filter by run_id, error=true, or tool.name.

Common failure patterns Jaeger reveals

1) Broken context propagation (fragmented traces)

If tool calls run as separate services, missing trace propagation breaks the timeline. You’ll see disconnected traces instead of one end-to-end trace. Fix: propagate trace headers (W3C Trace Context) into tool HTTP calls or internal RPC.

2) “Tool call succeeded” but agent still failed

This often indicates parsing/validation issues (schema mismatch), prompt regression, or poor retrieval. The trace shows tool latency is fine; failure happens in the LLM generation span or post-processing span.

3) Slow runs caused by retries

Retries add up. In Jaeger, you’ll see repeated tool spans. Add attributes like retry.count and retry.reason to make it obvious.

Setup overview: OTel → Collector → Jaeger

A simple production-friendly architecture is:

Agent Runtime (OTel SDK)  ->  OTel Collector  ->  Jaeger (storage + UI)

Export OTLP from your agent to the Collector, apply tail sampling + redaction there, and export to Jaeger.

Privacy + redaction guidance

  • Do not store raw prompts/tool arguments by default in production traces.
  • Store summaries, hashes, or classified metadata (e.g., “contains_pii=true”) instead.
  • Keep detailed logging behind feature flags, short retention, and strict RBAC.

Tools & platforms (official + GitHub links)

Production checklist

  • Define a span naming convention + attribute schema (run_id, tool attributes, model info).
  • Propagate trace context into tool calls (headers/middleware).
  • Use tail sampling to keep full traces for failures/slow runs.
  • Redact PII/secrets and restrict access to sensitive trace fields.
  • Train the team on a basic incident workflow: “get run_id → find trace → identify slow/error span → fix.”

FAQ

Jaeger vs Tempo: which should I use?

If you want a straightforward tracing backend with a classic trace UI, Jaeger is a strong default. If you expect very high volume and want object-storage economics, Tempo can be a better fit (especially with Grafana).

Related reads on aivineet

Jaeger for LLM agents helps you debug tool calls, measure per-step latency, and keep distributed context across services.

Author’s Bio

Vineet Tiwari

Vineet Tiwari is an accomplished Solution Architect with over 5 years of experience in AI, ML, Web3, and Cloud technologies. Specializing in Large Language Models (LLMs) and blockchain systems, he excels in building secure AI solutions and custom decentralized platforms tailored to unique business needs.

Vineet’s expertise spans cloud-native architectures, data-driven machine learning models, and innovative blockchain implementations. Passionate about leveraging technology to drive business transformation, he combines technical mastery with a forward-thinking approach to deliver scalable, secure, and cutting-edge solutions. With a strong commitment to innovation, Vineet empowers businesses to thrive in an ever-evolving digital landscape.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *