Tag: Prompt Engineering

  • Tool Calling Reliability for LLM Agents: Schemas, Validation, Retries (Production Checklist)

    Tool calling is where most “agent demos” die in production. Models are great at writing plausible text, but tools require correct structure, correct arguments, and correct sequencing under timeouts, partial failures, and messy user inputs. If you want reliable LLM agents, you need a tool-calling reliability layer: schemas, validation, retries, idempotency, and observability.

    This guide is a practical, production-first checklist for making tool-using agents dependable. It focuses on tool schemas, strict validation, safe retries, rate limits, and the debugging instrumentation you need to stop “random” failures from becoming incidents.

    TL;DR

    • Define tight tool schemas (types + constraints) and validate inputs and outputs.
    • Prefer deterministic tools and idempotent actions where possible.
    • Use retries with backoff only for safe failure modes (timeouts, 429s), not logic errors.
    • Add timeouts, budgets, and stop conditions to prevent tool thrashing.
    • Log everything: tool name, args, response, latency, errors (with PII redaction).

    Table of Contents

    Why tool calling fails in production

    Tool calls fail for boring reasons – and boring reasons are the hardest to debug when an LLM is in the loop:

    • Schema drift: the tool expects one shape; the model produces another.
    • Ambiguous arguments: the model guesses missing fields (wrong IDs, wrong dates, wrong currency).
    • Partial failures: retries, timeouts, and 429s create inconsistent state.
    • Non-idempotent actions: “retry” creates duplicates (double charge, duplicate ticket, repeated email).
    • Tool thrashing: the agent loops, calling tools without converging.

    Therefore, reliability comes from engineering the boundary between the model and the tools – not from “better prompting” alone.

    Tool schemas: types, constraints, and guardrails

    A good tool schema is more than a JSON shape. It encodes business rules and constraints so the model has fewer ways to be wrong.

    Design principles

    • Make required fields truly required. No silent defaults.
    • Use enums for modes and categories (avoid free text).
    • Constrain strings with patterns (e.g., ISO dates, UUIDs).
    • Separate “intent” from “execution” (plan first, act second).

    Example: a strict tool schema (illustrative)

    {
      "name": "create_support_ticket",
      "description": "Create a support ticket in the helpdesk.",
      "parameters": {
        "type": "object",
        "additionalProperties": false,
        "required": ["customer_id", "subject", "priority", "body"],
        "properties": {
          "customer_id": {"type": "string", "pattern": "^[0-9]{6,}$"},
          "subject": {"type": "string", "minLength": 8, "maxLength": 120},
          "priority": {"type": "string", "enum": ["low", "medium", "high", "urgent"]},
          "body": {"type": "string", "minLength": 40, "maxLength": 4000},
          "idempotency_key": {"type": "string", "minLength": 12, "maxLength": 80}
        }
      }
    }

    Notice the constraints: no extra fields, strict required fields, patterns, and an explicit idempotency key.

    Validation: input, output, and schema enforcement

    In production, treat the model as an untrusted caller. Validate both directions:

    • Input validation: before the tool runs (types, required fields, bounds).
    • Output validation: after the tool runs (expected response schema).
    • Semantic validation: sanity checks (dates in the future, currency totals add up, IDs exist).

    Example: schema-first execution (pseudo)

    1) Model proposes tool call + arguments
    2) Validator checks JSON schema (reject if invalid)
    3) Business rules validate semantics (reject if unsafe)
    4) Execute tool with timeout + idempotency key
    5) Validate tool response schema
    6) Only then show final answer to user

    Retries: when they help vs when they make it worse

    Retries are useful for transient failures (timeouts, 429 rate limits). However, they are dangerous for logic failures (bad args) and non-idempotent actions.

    • Retry timeouts, connection errors, and 429s with exponential backoff.
    • Do not retry 400s without changing arguments (force the model to correct the call).
    • Cap retries and add a fallback path (ask user for missing info, escalate to human).

    Idempotency: the key to safe actions

    Idempotency means “the same request can be applied multiple times without changing the result.” It is the difference between safe retries and duplicated side effects.

    • For write actions (create ticket, charge card, send email), require an idempotency key.
    • Store and dedupe by that key for a reasonable window.
    • Return the existing result if the key was already processed.

    Budgets, timeouts, and anti-thrashing

    • Timeout every tool call (hard upper bound).
    • Budget tool calls per task (e.g., max 8 calls) and max steps.
    • Stop conditions: detect loops, repeated failures, or repeated identical calls.
    • Ask-for-clarification triggers: missing IDs, ambiguous user intent, insufficient context.

    Observability: traces, audits, and debugging

    When a tool-using agent fails, you need to answer: what did it try, what did the tool return, and why did it choose that path?

    • Log: tool name, args (redacted), response (redacted), latency, retries, error codes.
    • Add trace IDs across model + tools for end-to-end debugging.
    • Store “replayable” runs for regression testing.

    Production checklist

    • Define strict tool schemas (no extra fields).
    • Validate inputs and outputs with schemas.
    • Add semantic checks for high-risk parameters.
    • Enforce timeouts + budgets + stop conditions.
    • Require idempotency keys for side-effect tools.
    • Retry only safe transient failures with backoff.
    • Instrument tracing and tool-call audits (with redaction).

    FAQ

    Is prompting enough to make tool calling reliable?

    No. Prompting helps, but production reliability comes from schemas, validation, idempotency, and observability.

    What should I implement first?

    Start with strict schemas + validation + timeouts. Then add idempotency for write actions, and finally build monitoring and regression evals.

    Related reads on aivineet

  • Agent Evaluation Framework: How to Test LLM Agents (Offline Evals + Production Monitoring)

    If you ship LLM agents in production, you’ll eventually hit the same painful truth: agents don’t fail once-they fail in new, surprising ways every time you change a prompt, tool, model, or knowledge source. That’s why you need an agent evaluation framework: a repeatable way to test LLM agents offline, monitor them in production, and stop regressions before customers do.

    This guide gives you a practical, enterprise-ready evaluation stack: offline evals, golden tasks, scoring rubrics, automated regression checks, and production monitoring (traces, tool-call audits, and safety alerts). If you’re building under reliability/governance constraints, this is the fastest way to move from “it works on my laptop” to “it holds up in the real world.”

    Moreover, an evaluation framework is not a one-time checklist. It is an ongoing loop that improves as your agent ships to more users and encounters more edge cases.

    TL;DR

    • Offline evals catch regressions early (prompt changes, tool changes, model upgrades).
    • Evaluate agents on task success, not just “answer quality”. Track tool-calls, latency, cost, and safety failures.
    • Use golden tasks + adversarial tests (prompt injection, tool misuse, long context failures).
    • In production, add tracing + audits (prompt/tool logs), plus alerts for safety/quality regressions.
    • Build a loop: Collect → Label → Evaluate → Fix → Re-run.

    Table of Contents

    What is an agent evaluation framework?

    An agent evaluation framework is the system you use to measure whether an LLM agent is doing the right thing reliably. It includes:

    • A set of representative tasks (real user requests, not toy prompts)
    • A scoring method (success/failure + quality rubrics)
    • Automated regression tests (run on every change)
    • Production monitoring + audits (to catch long-tail failures)

    Think of it like unit tests + integration tests + observability-except for an agent that plans, calls tools, and works with messy context.

    Why agents need evals (more than chatbots)

    Agents are not “just chat.” Instead, they:

    • call tools (APIs, databases, browsers, CRMs)
    • execute multi-step plans
    • depend on context (RAG, memory, long documents)
    • have real-world blast radius (wrong tool action = real incident)

    Therefore, your evals must cover tool correctness, policy compliance, and workflow success-not only “did it write a nice answer?”

    Metrics that matter: success, reliability, cost, safety

    Core outcome metrics

    • Task success rate (binary or graded)
    • Step success (where it fails: plan, retrieve, tool-call, final synthesis)
    • Groundedness (are claims supported by citations / tool output?)

    Reliability + quality metrics

    • Consistency across runs (variance with temperature, retries)
    • Instruction hierarchy compliance (system > developer > user)
    • Format adherence (valid JSON/schema, required fields present)

    Operational metrics

    • Latency (p50/p95 end-to-end)
    • Cost per successful task (tokens + tool calls)
    • Tool-call budget (how often agents “thrash”)

    Safety metrics

    • Prompt injection susceptibility (tool misuse, exfil attempts)
    • Data leakage (PII in logs/output)
    • Policy violations (disallowed content/actions)

    Offline evals: datasets, golden tasks, and scoring

    The highest ROI practice is building a small eval set that mirrors reality: 50-200 tasks from your product. For example, start with the top workflows and the most expensive failures.

    Step 1: Create “golden tasks”

    Golden tasks are the agent equivalent of regression tests. Each task includes:

    • input prompt + context
    • tool stubs / fixtures (fake but realistic tool responses)
    • expected outcome (pass criteria)

    Step 2: Build a scoring rubric (human + automated)

    Start simple with a 1-5 rubric per dimension. Example:

    Score each run (1-5):
    1) Task success
    2) Tool correctness (right tool, right arguments)
    3) Groundedness (claims match tool output)
    4) Safety/policy compliance
    5) Format adherence (JSON/schema)
    
    Return:
    - scores
    - failure_reason
    - suggested fix

    Step 3: Add adversarial tests

    Enterprises get burned by edge cases. Add tests for:

    • prompt injection inside retrieved docs
    • tool timeouts and partial failures
    • long context truncation
    • conflicting instructions

    Production monitoring: traces, audits, and alerts

    Offline evals won’t catch everything. In production, therefore, add:

    • Tracing: capture the plan, tool calls, and intermediate reasoning outputs (where allowed).
    • Tool-call audits: log tool name + arguments + responses (redact PII).
    • Alerts: spikes in failure rate, cost per task, latency, or policy violations.

    As a result, production becomes a data pipeline: failures turn into new eval cases.

    3 implementation paths (simple → enterprise)

    Path A: Lightweight (solo/early stage)

    • 50 golden tasks in JSONL
    • manual review + rubric scoring
    • run weekly or before releases

    Path B: Team-ready (CI evals)

    • run evals on every PR that changes prompts/tools
    • track p95 latency + cost per success
    • store traces + replay failures

    Path C: Enterprise (governed agents)

    • role-based access to logs and prompts
    • redaction + retention policies
    • approval workflows for high-risk tools
    • audit trails for compliance

    A practical checklist for week 1

    • Pick 3 core workflows and extract 50 tasks from them.
    • Define success criteria + rubrics.
    • Stub tool outputs for deterministic tests.
    • Run baseline on your current agent and record metrics.
    • Add 10 adversarial tests (prompt injection, tool failures).

    FAQ

    How many eval cases do I need?

    Start with 50-200 real tasks. You can get strong signal quickly. Expand based on production failures.

    Should I use LLM-as-a-judge?

    Yes, but don’t rely on it blindly. Use structured rubrics, spot-check with humans, and keep deterministic checks (schema validation, tool correctness) wherever possible.

    Related reads on aivineet

  • LLM Evaluation: Stop AI Hallucinations with a Reliability Stack

    LLMs are impressive—until they confidently say something wrong.

    If you’ve built a chatbot, a support assistant, a RAG search experience, or an “agent” that takes actions, you’ve already met the core problem: hallucinations. And the uncomfortable truth is: you won’t solve it with a single prompt tweak.

    You solve it the same way you solve uptime or performance: with a reliability stack.

    This guide explains a practical approach to LLM evaluation that product teams can actually run every week—without turning into a research lab.

    TL;DR

    • Hallucinations are not a rare edge case; they’re a predictable failure mode.
    • The fix is not one trick—it’s a system: Test → Ground → Guardrail → Monitor.
    • You need an evaluation dataset (“golden set”) and automated checks before shipping.
    • RAG apps must evaluate retrieval quality and groundedness, not just “good answers”.
    • Production monitoring is mandatory: regressions will happen.

    Why LLMs hallucinate (quick explanation)

    LLMs predict the next token based on patterns in training data. They’re optimized to be helpful and fluent, not to be strictly factual.

    So when a user asks something ambiguous, something outside the model’s knowledge, something that requires exact policy wording, or something that depends on live data…the model may “fill in the blank” with plausible text.

    Your job isn’t to demand perfection. Your job is to build systems where wrong outputs become rare, detectable, and low-impact.

    The Reliability Stack (Test → Ground → Guardrail → Monitor)

    1) TEST: Build automated LLM evaluation before you ship

    Most teams “evaluate” by reading a few chats and saying “looks good.” That doesn’t scale.

    Step 1: Create an eval dataset (your “golden set”)

    Start with 50–100 real questions from your product or niche. Include:

    • top user intents (what you see daily)
    • high-risk intents (payments, security, health, legal)
    • known failures (copy from logs)
    • edge cases (missing info, conflicting context, weird phrasing)

    Each test case should have: Input (prompt + context), Expected behavior, and a Scoring method.

    Tip: Don’t force exact matching. Define behavior rules (must cite sources, must ask clarifying questions, must refuse when policy requires it, must call a tool instead of guessing).

    Step 2: Use 3 scoring methods (don’t rely on only one)

    A) Rule-based checks (fast, deterministic)

    • “Must include citations”
    • “Must not output personal data”
    • “Must return valid JSON schema”
    • “Must not claim certainty without evidence”

    B) LLM-as-a-judge (good for nuance)

    Use a judge prompt with a strict rubric to score: groundedness, completeness, and policy compliance.

    C) Human review (calibration + high-risk)

    • review a sample of passing outputs
    • review all high-risk failures
    • review new feature areas

    Step 3: Run evals for every change (like CI)

    Trigger your eval suite whenever you change the model, system prompt, retrieval settings, tools/function calling, safety filters, or routing logic. If scores regress beyond a threshold, block deploy.

    2) GROUND: Force answers to be traceable (especially for RAG)

    If correctness matters, the model must be grounded.

    Grounding method A: RAG (docs / KB)

    Common RAG failure modes: retrieval returns irrelevant docs, returns nothing, context is too long/noisy, docs are outdated.

    What to do: require answers only using retrieved context, require citations (doc id/URL), and if context is weak: ask clarifying questions or refuse.

    Grounding method B: Tools (APIs, DB queries)

    If the answer depends on live facts (pricing, account, inventory), don’t let the model guess—fetch data via tools and then summarize.

    Grounding method C: Constrained output formats

    If the LLM outputs code/SQL/JSON/tool calls: validate schema, reject unsafe actions, and add a repair step for formatting errors.

    3) GUARDRAILS: Reduce harm when the model is uncertain

    Guardrails aren’t “restricting AI.” They’re risk management.

    Guardrail A: “I don’t know” + escalation

    A safe assistant should admit uncertainty and offer a next step (search sources, ask for details, escalate to a human).

    Guardrail B: Mandatory citations in factual mode

    If it can’t cite sources, it should not claim facts. Offer general guidance and label it clearly.

    Guardrail C: Risk tiers by intent

    • Low risk: drafting, brainstorming, rewriting
    • Medium risk: troubleshooting, product policy
    • High risk: legal/medical/payments/security

    High risk needs stricter prompts, stronger grounding, and human handoff.

    Guardrail D: Tool permissioning (for agents)

    If an LLM can take actions: use allowlists, confirmations for destructive steps, rate limits, and audit logs.

    4) MONITOR: Production observability (where real failures show up)

    Even perfect test suites won’t catch everything. Your model will drift.

    Minimum logging (do this early)

    • prompt + system message version
    • model name/version
    • retrieved docs + scores (RAG)
    • tool calls + parameters
    • response
    • user feedback
    • latency + token cost

    (Redact sensitive content in logs.)

    Metrics that matter

    • Grounded answer rate: % answers with citations in factual mode
    • Escalation rate: how often the bot hands off
    • User satisfaction: feedback + resolution rate
    • Retrieval quality: % queries where top docs pass a relevance threshold
    • Regression alerts: eval score drops after changes

    LLM Evaluation Checklist (for teams)

    • Offline: eval dataset (50–200), automated checks, regression thresholds, versioned prompts/configs
    • Grounding: citations for factual mode, retrieval metrics, tool calls for live data
    • Guardrails: intent tiers, refusal + escalation path, tool permissions
    • Monitoring: logs with redaction, dashboards, regression alerts

    FAQ

    What is LLM evaluation?

    LLM evaluation is the process of testing an AI model’s outputs against a rubric (accuracy, safety, groundedness, format) using automated checks and human review.

    How do you reduce AI hallucinations?

    You reduce hallucinations with a reliability stack: automated tests, grounding (RAG/tools/citations), guardrails (refusal/escalation), and production monitoring.

    What is RAG evaluation?

    RAG evaluation checks whether retrieval returns the right documents and whether the final answer is grounded in those documents using citation and correctness scoring.

  • Why Agent Memory Is the Next Big AI Trend (And Why Long Context Isn’t Enough)

    TL;DR

    • Agent Memory is mostly about making agent behavior predictable and auditable.
    • Make tools safe: schemas, validation, retries/timeouts, and idempotency.
    • Ground answers with retrieval (RAG) and measure reliability with evals.
    • Add observability so you can answer: what happened and why.

    AI is quickly shifting from chatbots to agents: systems that can plan, call tools, and complete tasks across apps. But there’s a major limitation holding agents back in real-world use:

    They don’t remember well.

    Without memory, agents repeat questions, forget preferences, lose context between sessions, and make inconsistent decisions. This is why agent memory is becoming one of the most important next trends in AI.


    What is “agent memory”?

    Agent memory is any system that allows an AI agent to persist and reuse information across time, beyond a single prompt window.

    Memory can include:

    • Facts about the user or organization (preferences, policies, configurations)
    • Past conversations and decisions (what was tried, what worked, what failed)
    • Task progress (plans, subtasks, intermediate outputs)
    • External state (documents, tickets, code changes, dashboards)

    The key idea is that an agent should not have to “relearn” everything in every conversation.


    Why long context windows are not enough

    It’s tempting to assume that bigger context windows solve memory. They help, but they don’t fully solve it for production systems.

    Common problems with “just stuff everything in context”:

    • Cost: sending large histories increases token usage and latency.
    • Noise: long histories contain irrelevant messages that distract the model.
    • Redundancy: repeated or similar interactions waste context space.
    • Weak retrieval: the model may miss the most important detail buried in a long transcript.
    • Security: you may not want to expose all historical data to every request.

    So the next step is not only bigger context — it’s better memory management.


    The modern memory pipeline (capture → compress → store → retrieve)

    Most practical memory systems follow a pipeline:

    1) Capture

    Record useful events from agent interactions, such as:

    • user preferences (tone, goals, tools used)
    • task outcomes (success/failure, links, artifacts)
    • important constraints (budget, policies, deadlines)

    2) Compress

    Convert raw chat logs into compact, structured memory units. Examples:

    • bullet summaries
    • key-value facts
    • decision records (“we chose X because Y”)

    3) Store

    Store memory in a system that supports retrieval. This might be:

    • a database table (structured facts)
    • a vector store (semantic recall)
    • a hybrid store (both structured + semantic)

    4) Retrieve (query-aware)

    At inference time, retrieve only what is relevant to the current goal. Retrieval can be based on:

    • semantic similarity (“this looks like a billing issue”)
    • filters (project, user, time window)
    • importance scoring (“critical policy constraints”)

    5) Consolidate (optional but powerful)

    Over time, you may merge related memories into higher-level summaries to reduce redundancy and improve reliability. This is similar to how humans form stable knowledge from repeated experiences.


    What agent memory enables (real examples)

    • Customer support agents that remember prior tickets, preferences, and recurring issues.
    • Coding agents that remember repo conventions, architecture decisions, and build/test commands.
    • Ops/SRE agents that remember incident timelines, previous fixes, and service-specific runbooks.
    • Personal assistants that remember schedules, communication style, and repeated tasks.

    How to start building agent memory (practical steps)

    1. Start small: store 20–100 important facts per user/project in a simple database.
    2. Add retrieval: fetch relevant facts based on the user’s request and the agent’s goal.
    3. Add summaries: compress long sessions into short “memory cards”.
    4. Measure quality: track whether memory reduces repeated questions and improves task completion.
    5. Add guardrails: don’t store secrets; add data retention rules; restrict what memories can be used.

    Why this is likely the next big AI layer

    Models keep improving, but many failures in agents come from missing context and inconsistent state. Memory systems are becoming the layer that turns a capable model into a reliable product.

    That’s why “agent memory” (and related ideas like memory consolidation and memory operating systems) is quickly becoming a major trend in AI development.