Engineering8 min read

Voice Agents in Production: The LangSmith Debugging Playbook (Turns, Traces, Audio).

Trace voice agents end-to-end with LangSmith + Pipecat + OTEL. Debug turns, STT/LLM/TTS latency, tool errors, and attach audio safely in production.

Tega Adeyemi
Tega Adeyemi
Voice Agents in Production: The LangSmith Debugging Playbook (Turns, Traces, Audio).

Trace every STT → LLM → TTS hop (with turn boundaries and optional audio), replay real sessions, and turn “why did it say that?” from a Slack mystery into a link you can inspect.

Voice agents are the most unfair type of AI system.

A chatbot fails quietly in text.
A voice agent fails out loud, in real time, while the user says:

User: “Hello?”
Agent: “Absolutely—here’s a detailed explanation of quantum tunneling.”
User: “…I asked if you ship to Athens.”

So if we’re building voice in production, we need observability that understands voice-native workflows: turns, conversations, audio artifacts, and the full chain from mic → STT → LLM → TTS.

This guide is a practical, corrected, copy/paste-safe walkthrough of debugging voice agents with LangSmith using Pipecat + OpenTelemetry (OTEL) tracing—based on LangSmith’s official Pipecat tracing doc.

Table of Contents

  1. Why voice agents are uniquely hard to debug
  2. The debugging stack we recommend
  3. Correct setup: OTEL → LangSmith (with “EU endpoint” notes)
  4. Pipecat tracing quickstart (with version-safe guidance)
  5. Reading a voice trace like a detective
  6. The 8 failure modes (and how traces prove them)
  7. Audio-aware debugging (attach recordings safely)
  8. Shipping-grade tips: sampling, performance, privacy
  9. Comparisons: LangSmith vs Langfuse vs Phoenix vs Helicone vs generic APM
  10. Key takeaways

1 Why voice agents are uniquely hard to debug

Voice agents aren’t “an LLM call.” They’re a pipeline:

When something goes wrong, it’s rarely “the model hallucinated.” It’s usually:

So we trace voice like a pipeline, not like a single request.

2 The debugging stack we recommend

The stack:

  1. OpenTelemetry traces (standard, vendor-agnostic telemetry)
  2. Pipecat pipeline spans (turns + STT/LLM/TTS steps)
  3. LangSmith for LLM/agent-native trace visualization (messages, turns, artifacts, evaluation)

LangSmith’s “Trace Pipecat applications” guide uses OTEL plus a custom span processor that maps Pipecat spans into LangSmith’s trace format (including conversation/turn structure and optional audio attachment).

3 Correct setup: OpenTelemetry → LangSmith

Install LangSmith OTEL support (important)

LangSmith’s OpenTelemetry tracing docs explicitly reference installing langsmith[otel] (and using recent versions).

pip install "langsmith[otel]" opentelemetry-exporter-otlp python-dotenv

Environment variables (US + EU endpoints)

LangSmith OTEL ingestion endpoint is .../otel, and EU orgs use a different endpoint.

# --- LangSmith OTEL Ingestion (US) ---
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.smith.langchain.com/otel

# --- OR if your LangSmith org is EU-hosted ---
# OTEL_EXPORTER_OTLP_ENDPOINT=https://eu.api.smith.langchain.com/otel

# Headers: x-api-key is the canonical key for OTEL ingestion.
# You can also include project routing (exact format may vary by SDK/tooling).
OTEL_EXPORTER_OTLP_HEADERS=x-api-key=<YOUR_LANGSMITH_API_KEY>,LANGSMITH_PROJECT=pipecat-voice

# Your model keys, as needed:
OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>

Two important guardrails:

4 Pipecat tracing quickstart with version-safe guidance

Pipecat is evolving fast. Import paths can drift between releases. So we’ll do this in a way that doesn’t break your weekend:

Install Pipecat + required extras

LangSmith’s Pipecat tracing guide references installing Pipecat plus extras depending on the services you use.

pip install langsmith pipecat-ai opentelemetry-exporter-otlp python-dotenv

If you’re using audio recording features, you may also need additional packages (e.g., numpy, scipy). The LangSmith Pipecat guide mentions extra dependencies for recordings.

Add LangSmith’s Pipecat span processor

LangSmith’s guide uses a custom span processor (often provided as langsmith_processor.py) that:

Recommendation: vendor that file into your repo and treat it like production code (version it, test it, review diffs when LangSmith updates it).

Minimal “runs-and-traces” skeleton

Below is a structure-first skeleton (the important pieces are correct). You’ll plug in the specific Pipecat service classes for your STT/LLM/TTS and your chosen transport.

import asyncio
import uuid
from dotenv import load_dotenv

load_dotenv()

# ✅ Import the LangSmith span processor from the official guide implementation
# (Typically a local file: langsmith_processor.py)
from langsmith_processor import span_processor  # noqa: F401

# NOTE: Pipecat imports vary by version.
# Use the correct imports for your installed release:
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask

# Also import:
# - your Transport (mic/speaker or WebRTC)
# - your STT service
# - your LLM service
# - your TTS service
# - (optional) audio recorder


async def main():
    conversation_id = str(uuid.uuid4())

    # 1) Create your transport + services (version-specific)
    transport = make_transport_somehow()
    stt = make_stt_service()
    llm = make_llm_service()
    tts = make_tts_service()

    # 2) Build pipeline in the “voice order”
    pipeline = Pipeline([
        transport.input(),
        stt,
        llm,          # or your context aggregator → llm chain
        tts,
        transport.output(),
    ])

    # 3) Create task with tracing + turn tracking
    task = PipelineTask(
        pipeline,
        params=PipelineParams(enable_metrics=True),
        enable_tracing=True,
        enable_turn_tracking=True,
        conversation_id=conversation_id,
    )

    # 4) Run
    runner = PipelineRunner()
    await runner.run(task)


if __name__ == "__main__":
    asyncio.run(main())

Why we wrote it this way: the wiring (tracing flags, conversation_id, pipeline ordering) is stable and matches LangSmith’s guide; the service/transport imports are the piece most likely to drift across Pipecat versions.

5 Reading a voice trace like a detective

When a trace shows up in LangSmith, don’t start with the final assistant message. That’s how we get emotionally manipulated by confident audio.

Read it like this:

  1. Conversation span
    • Is this the right session? Check conversation_id.
  2. Turn spans
    • Did the system split turns correctly? Are there overlaps?
  3. STT spans
    • What did we transcribe (partial vs final)? Any missing words?
  4. LLM spans
    • What context did the LLM actually see (system prompt, previous turns, tool outputs)?
  5. TTS spans
    • Was synthesis delayed? Was it interrupted/canceled correctly?

Once we read traces like a pipeline, debugging becomes… boring. And boring is good.

6 The 8 failure modes and how traces prove them

1 “It ignored what I said”

Usually: VAD clipped speech or STT produced a partial transcript that never got corrected.

Trace proof: compare STT output vs expected utterance.

Fix: tune VAD; gate on final transcript; merge partials safely.

2 “It answered the previous question”

Usually: turn tracking split incorrectly or aggregation appended the wrong message.

Trace proof: turn spans + messages shown to LLM.

Fix: keep enable_turn_tracking=True and validate aggregation behavior.

3 “It hallucinated a tool result”

Usually: tool failed or timed out and LLM improvised.

Trace proof: missing tool output in LLM inputs.

Fix: write tool failures into context explicitly (“ToolError: …”), don’t swallow.

4 “Latency spikes randomly”

Usually: STT chunking, network jitter, cold starts, or TTS bottlenecks.

Trace proof: per-span timing (STT vs LLM vs TTS).

Fix: caching, prewarm, reduce tool calls in the critical path.

5 “It talks over me”

Usually: TTS cancellation isn’t wired to user speech/VAD events.

Trace proof: overlapping TTS spans and new user turn spans.

Fix: interruption policy: cancel TTS on user speech + mark interruption event.

6 “Correct in text, wrong in voice”

Usually: STT mishears domain terms.

Fix: vocabulary biasing, post-STT correction, confirm key entities.

7 “Works locally, not in prod”

Usually: wrong OTEL endpoint/headers, missing exporter, or EU/US mismatch.

Trace proof: no spans arriving; missing exporter config.

Fix: verify endpoint and x-api-key header; set EU endpoint if needed.

8 “We can’t reproduce it”

Usually: no replay (audio/transcript), missing correlation IDs.

Fix: attach audio (with privacy controls), log conversation_id everywhere.

7 Audio-aware debugging

LangSmith’s Pipecat guide includes patterns for:

Here’s the conceptual pattern you should implement (names may vary by your Pipecat version, but the steps matter):

from pathlib import Path

recordings_dir = Path("./recordings")
recordings_dir.mkdir(parents=True, exist_ok=True)

recording_path = recordings_dir / f"{conversation_id}.wav"
audio_recorder = AudioRecorder(str(recording_path))  # Pipecat recorder class (version-specific)

# ✅ Register the recording so LangSmith can attach it to the trace
span_processor.register_recording(conversation_id, str(recording_path))

pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    tts,
    audio_recorder,      # ensure recorder is in the pipeline
    transport.output(),
])

await runner.run(task)

# ✅ IMPORTANT: Save BEFORE the conversation span fully closes (guide warns about timing)
audio_recorder.save_recording()

Privacy note: attach audio only when you truly need it—voice logs are extremely sensitive.

8 Shipping-grade tips: sampling, performance, privacy

Sampling: trace smarter, not louder

Performance: don’t let observability become the bottleneck

Security & privacy

9 Comparisons

LangSmith

Best if you want:

Langfuse (open source)

Strong choice for open-source LLM observability; supports OpenTelemetry integration.

Phoenix (Arize, open source)

Open-source tracing/evaluation workflows; supports LLM trace patterns and OTEL-based approaches.

Helicone

Great for gateway/proxy-style LLM observability and integrations (including OpenLLMetry paths), but don’t assume it’s identical to a generic OTLP backend.

Generic APM (Datadog, etc.)

Excellent infrastructure visibility and OTEL pipelines—often missing “turns/messages/evals” semantics unless you build them.

10 Key takeaways

Tega AdeyemiFebruary 16, 2026