OpenTelemetry Collector for LLM agents: The OpenTelemetry Collector is the most underrated piece of an LLM agent observability stack. Instrumenting your agent runtime is step 1. Step 2 (the step most teams miss) is operationalizing telemetry: routing, batching, sampling, redaction, and exporting traces/metrics/logs to the right backend without rewriting every service.

If you are building agents with tool calling, RAG, retries, and multi-step plans, your system generates a lot of spans. The Collector lets you keep what matters (errors/slow runs) while controlling cost and enforcing governance centrally.
TL;DR
- Think of the Collector as a programmable telemetry router: OTLP in → processors → exporters out.
- For LLM agents, the Collector is where you enforce consistent attributes like
run_id,tool.name,prompt.version,llm.model, andtenant. - Use tail sampling so you keep full traces for failed/slow runs and downsample successful runs.
- Implement redaction at the Collector layer so you never leak PII/secrets into your trace backend.
- Export via OTLP/Jaeger/Tempo/Datadog/New Relic-without touching app code.
Table of Contents
- What is the OpenTelemetry Collector?
- Why LLM agents need the Collector (not just SDK instrumentation)
- Collector architecture: receivers → processors → exporters
- A practical telemetry model for LLM agents
- Recommended pipelines for agents (traces, metrics, logs)
- Tail sampling patterns for agent runs
- Redaction, governance, and safe logging
- Deployment options and scaling
- Troubleshooting and validation
- Tools & platforms (official + GitHub links)
- Production checklist
- FAQ
What is the OpenTelemetry Collector?
The OpenTelemetry Collector is a vendor-neutral service that receives telemetry (traces/metrics/logs), processes it (batching, filtering, sampling, attribute transforms), and exports it to one or more observability backends.
Instead of configuring exporters inside every microservice/agent/tool, you standardize on sending OTLP to the Collector. From there, your team can change destinations, apply policy, and manage cost in one place.
Why LLM agents need the Collector (not just SDK instrumentation)
- Central policy: enforce PII redaction, attribute schema, and retention rules once.
- Cost control: agents generate high span volume; the Collector is where sampling and filtering becomes practical.
- Multi-backend routing: send traces to Tempo for cheap storage, but also send error traces to Sentry/Datadog/New Relic.
- Reliability: buffer/batch/queue telemetry so your app doesn’t block on exporter issues.
- Consistency: align tool services, background workers, and the agent runtime under one trace model.
Collector architecture: receivers → processors → exporters
The Collector is configured as pipelines:
receivers -> processors -> exporters
(OTLP in) (policy) (destinations)
Typical building blocks you’ll use for agent systems:
- Receivers:
otlp(gRPC/HTTP), sometimesjaegerorzipkinfor legacy sources. - Processors:
batch,attributes,transform,tail_sampling,memory_limiter. - Exporters:
otlp/otlphttpto Tempo/OTel backends, Jaeger exporter, vendor exporters.
A practical telemetry model for LLM agents
Before you write Collector config, define a small attribute schema. This makes traces searchable and makes sampling rules possible.
- Trace = 1 user request / 1 agent run
- Span = a step (plan, tool call, retrieval, final response)
- Key attributes (examples):
run_id: stable id you also log in your apptenant/org_id: for multi-tenant systemstool.name,tool.type,tool.status,tool.latency_msllm.provider,llm.model,llm.tokens_in,llm.tokens_outprompt.versionorprompt.hashrag.top_k,rag.source,rag.hit_count(avoid raw content)
Recommended pipelines for agents (traces, metrics, logs)
Most agent teams should start with traces first, then add metrics/logs once the trace schema is stable.
Minimal traces pipeline (starter)
receivers:
otlp:
protocols:
grpc:
http:
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
batch:
timeout: 2s
send_batch_size: 2048
exporters:
otlphttp/tempo:
endpoint: http://tempo:4318
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp/tempo]
Agent-ready traces pipeline (attributes + tail sampling)
This is where the Collector starts paying for itself: you keep the traces that matter.
processors:
attributes/agent:
actions:
# Example: enforce a standard service.name if missing
- key: service.name
action: upsert
value: llm-agent
tail_sampling:
decision_wait: 10s
num_traces: 50000
expected_new_traces_per_sec: 200
policies:
# Keep all error traces
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# Keep slow runs (e.g., total run > 8s)
- name: slow
type: latency
latency:
threshold_ms: 8000
# Otherwise sample successful runs at 5%
- name: probabilistic-success
type: probabilistic
probabilistic:
sampling_percentage: 5
Tail sampling patterns for agent runs
Agent systems are spiky: a single run can generate dozens of spans (planner + multiple tool calls + retries). Tail sampling helps because it decides after it sees how the trace ended.
- Keep 100% of traces where
error=trueor span status is ERROR. - Keep 100% of traces where a tool returned 401/403/429/500 or timed out.
- Keep 100% of traces where the run latency exceeds a threshold.
- Sample the rest (e.g., 1-10%) for baseline performance monitoring.
Redaction, governance, and safe logging
LLM systems deal with sensitive inputs (customer text, internal docs, credentials). Your tracing stack must be designed for safety. Practical rules:
- Never export secrets: API keys, tokens, cookies. Log references (key_id) only.
- Redact PII: emails, phone numbers, addresses. Avoid raw prompts/tool arguments in production.
- Separate data classes: store aggregated metrics longer; store raw prompts/traces on short retention.
- RBAC: restrict who can view tool arguments, retrieved snippets, and prompt templates.
- Auditability: keep enough metadata to answer “who/what/when” without storing raw payloads.
Deployment options and scaling
- Sidecar: best when you want per-service isolation; simpler network policies.
- DaemonSet (Kubernetes): good default; each node runs a Collector instance.
- Gateway: centralized Collectors behind a load balancer; good for advanced routing and multi-tenant setups.
Also enable memory_limiter + batch to avoid the Collector becoming the bottleneck.
Troubleshooting and validation
- Verify your app exports OTLP: you should see spans in the backend within seconds.
- If traces are missing, check network (4317 gRPC / 4318 HTTP) and service discovery.
- Add a temporary logging exporter in non-prod to confirm the Collector receives data.
- Ensure context propagation works across tools; otherwise traces will fragment.
Tools & platforms (official + GitHub links)
- OpenTelemetry: opentelemetry.io
- OpenTelemetry Collector: GitHub
- Jaeger: jaegertracing.io | GitHub
- Grafana Tempo: grafana.com/oss/tempo | GitHub
- Zipkin: zipkin.io | GitHub
Production checklist
- Define a stable trace/attribute schema for agent runs (
run_id, tool spans, prompt version). - Route OTLP to the Collector (don’t hard-code exporters per service).
- Enable batching + memory limits.
- Implement tail sampling for errors/slow runs and downsample success.
- Add redaction rules + RBAC + retention controls.
- Validate end-to-end trace continuity across tool services.
FAQ
Do I need the Collector if I already use an APM like Datadog/New Relic?
Often yes. The Collector lets you enforce sampling/redaction and route telemetry cleanly. You can still export to your APM-it becomes one destination rather than the only architecture.
Should I store prompts and tool arguments in traces?
In production, avoid raw payloads by default. Store summaries/hashes and only enable detailed logging for short-lived debugging with strict access control.
Related reads on aivineet
- LLM Agent Tracing & Distributed Context | OpenTelemetry (OTel)
- LLM Agent Observability & Audit Logs
- Tool Calling Reliability for LLM Agents
OpenTelemetry Collector for LLM agents is especially useful for agent systems where you need to debug tool calls and control telemetry cost with tail sampling.




