Mastering LangSmith: Observability and Evaluation for LLM Applications

Building with LLMs is powerful, but unpredictable. LangSmith brings order to the chaos with tools for observability, evaluation, and optimization. See what your models are doing, measure how they’re performing, and deploy with confidence.

LangSmith is a robust platform that helps you build production-grade LLM applications, providing observability, evaluation, and optimization capabilities. It goes beyond traditional tools by giving developers meaningful insights into their application’s performance and reliability, enabling rapid iteration and enhancement. In this blog, we'll dive deep into LangSmith's features, covering setup, observability, evaluation, and advanced capabilities like automated testing and A/B experiments. Whether you're building from scratch or refining an existing application, LangSmith will help you deploy with confidence.

Why LangSmith?

LLMs are non-deterministic by nature, which makes observability and rigorous evaluation crucial for maintaining control over their behavior. Unlike traditional software, LLMs can produce varied outputs even for similar inputs, making debugging and quality assurance more complex. LangSmith addresses these challenges by offering LLM-native observability, evaluation workflows, and powerful debugging tools, ensuring your models are predictable and reliable.

And the best part? Even if you’re not using LangChain, LangSmith works independently, so you can easily integrate it into your own LLM workflows.

Getting Started with LangSmith

Installation and Setup

Getting started is easy! You can install LangSmith using pip. It’s available in both Python and TypeScript.

pip install -U langsmith

Once installed, set up your environment with an API key:

  • First, create an API key by navigating to the settings page.
  • Set environment variables:
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=<your-api-key>
export OPENAI_API_KEY=<your-openai-api-key>  # Optional if you're using OpenAI

Logging Your First Trace

If your application is built on LangChain, you don’t need the LangSmith SDK directly—you can enable tracing through LangChain itself. However, LangSmith also provides a direct way to log traces, which is useful if you’re integrating with custom or non-LangChain workflows.

Here's a quick example to add tracing in Python:

import openai
from langsmith.wrappers import wrap_openai
from langsmith import traceable

# Wrap the OpenAI client for tracing
oai_client = wrap_openai(openai.Client())

@traceable
def example_pipeline(user_input: str):
    response = oai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": user_input}]
    )
    return response.choices[0].message.content

example_pipeline("Hello, world!")

This snippet automatically traces the function and logs the LLM interaction with LangSmith, giving you deep visibility into how the model processes user inputs.

Run Your First Evaluation

Evaluations are key to ensuring your application delivers consistent and expected outcomes. In LangSmith, you can run evaluations using datasets and built-in or custom evaluators.

Here's how to run a simple evaluation using LangSmith's Python SDK:

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# Create a dataset
dataset = client.create_dataset("Sample Dataset", description="A sample dataset in LangSmith.")
client.create_examples(
  inputs=[{"postfix": "to LangSmith"}, {"postfix": "to Evaluations in LangSmith"}],
  outputs=[{"output": "Welcome to LangSmith"}, {"output": "Welcome to Evaluations in LangSmith"}],
  dataset_id=dataset.id,
)

# Define evaluator
def exact_match(run, example):
    return {"score": run.outputs["output"] == example.outputs["output"]}

experiment_results = evaluate(
  lambda input: "Welcome " + input['postfix'],  # Your AI system goes here
  data="Sample Dataset",  # The dataset to evaluate on
  evaluators=[exact_match],
  experiment_prefix="sample-experiment",
  metadata={"version": "1.0.0"},
)

With this approach, you can measure how well your application’s output matches expected results, allowing iterative improvements.

Observability in LLM Applications

Observability is crucial throughout all stages of LLM application development—from prototyping to beta testing to production. LLMs, by nature, can produce unexpected outputs, making debugging more challenging compared to traditional software systems.

Prototyping

During prototyping, having detailed observability allows you to iterate much more quickly on prompts, data inputs, and model settings. Set up observability from the very beginning:

  • Wrap your LLM client with LangSmith’s wrap_openai to trace interactions.
  • Use the traceable decorator for function-level tracing, giving you insight into the entire LLM pipeline.

Beta Testing

During beta testing, you’ll start seeing varied usage patterns from users. LangSmith helps you collect real-time feedback and metadata, making it easy to analyze user interactions and improve the application.

For example, you can log user feedback:

import uuid
from langsmith import Client

client = Client()
run_id = str(uuid.uuid4())

# Collect feedback from users
client.create_feedback(
    run_id,
    key="user-score",
    score=1.0,  # Assume this is collected from the user via some interface
)

Production

Once your application is in production, LangSmith’s observability features allow you to monitor key metrics like response times, feedback trends, and error rates. You can also use LangSmith’s monitoring dashboards to drill down into specific metrics or anomalies.

Getting Started: Using LangChain with LangSmith

To see LangSmith's features in action, let’s walk through an example using LangChain. Suppose you want to build a RAG (Retrieval-Augmented Generation) application to answer questions from The Odyssey. With LangSmith, you can easily add evaluation and tracing capabilities.

1. Install LangChain and other necessary packages:

pip install langchain_community chromadb bs4

2. Load documents and split them into smaller chunks to manage input size limitations:

from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://www.gutenberg.org/files/1727/1727-h/1727-h.htm")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

3. Store the document chunks in a vector database using embeddings, and trace the retrieval interactions:

from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma

oembed = OllamaEmbeddings(base_url="http://localhost:11434", model="nomic-embed-text")
vectorstore = Chroma.from_documents(documents=all_splits, embedding=oembed)

4. Ask questions and trace the entire chain:

from langchain.chains import RetrievalQA
from langsmith import traceable

@traceable
def qa_chain(query: str):
    retriever = vectorstore.as_retriever()
    docs = retriever.similarity_search(query)
    return docs

response = qa_chain("Who is Neleus and who is in Neleus' family?")

This integration gives you insight into how well your retrieval and generation steps are performing, and provides valuable tracing data.

Final Thoughts

LangSmith is an essential toolkit for building, debugging, and optimizing LLM-based applications. By providing observability, rigorous evaluation, and a set of powerful tools for monitoring and tracing, LangSmith ensures that your applications can move from prototype to production seamlessly and reliably.

Whether you're a developer looking to debug unexpected outputs or a data scientist trying to optimize your model’s performance, LangSmith provides the features you need to deploy high-quality LLM applications confidently.

Ready to give LangSmith a try? Install it today and start building with best-in-class observability and evaluation tools for your LLM projects.

Cohorte Team

November 14, 2024