Shipping Agents That Think in Code: A Practical, Opinionated Guide to Hugging Face smolagents

A practical, code-heavy guide to Hugging Face’s smolagents: how CodeAgent and ToolCallingAgent actually work, how to run them safely in production, and how they compare to LangChain, LangGraph, CrewAI, and friends—with concrete patterns, gotchas, and implementation tips for developers and AI leaders.

1. Why “agents that think in code” is a big deal

Most “AI agent” frameworks today do something like this:

LLM → JSON tool calls → glue code → more LLM calls.

It works… until we want:

Non-trivial control flow (loops, branches, retries)
Composing tools in complex ways
Experimenting quickly without drowning in abstractions

smolagents takes a different stance:

The agent’s “brain” is Python code.

Instead of asking a model to emit opaque tool-call blobs, a CodeAgent writes and runs Python directly to call tools, loop, branch, and transform data. The framework is thin: ~1k LOC of orchestration on top of raw code.

For us as engineers, this has two huge implications:

We can read the agent’s reasoning as code, line by line.
We can control the execution environment like any other Python runtime (security, sandboxing, observability).

In this guide, we’ll go deep on:

How smolagents is structured (CodeAgent vs ToolCallingAgent)
Realistic implementation patterns
Security / sandboxing options (E2B, Modal, Docker, Blaxel)
Where it shines vs other frameworks
How to avoid common pitfalls

We’ll assume you’re comfortable with Python, LLM APIs, and modern agent frameworks.

2. Quick mental model of smolagents

From the official docs:

"smolagents is an open-source Python library designed to make it extremely easy to build and run agents using just a few lines of code."

Core concepts:

Model abstraction: InferenceClientModel, LiteLLMModel, TransformersModel, OpenAIModel, AnthropicModel, etc. You plug in any LLM—HF Inference, OpenAI, Anthropic, local Transformers, Ollama via LiteLLM, etc.
Agents:
- CodeAgent: lets the model write Python code that gets executed.
- ToolCallingAgent: standard JSON tool calling (OpenAI-style).
Tools: simple Python functions decorated with @tool or subclasses of Tool, automatically described to the LLM.
Execution backends: local sandboxed Python, or remote sandboxes via Modal, Blaxel, E2B, Docker, Pyodide+Deno WebAssembly.
Multi-agent orchestration: build manager/worker graphs, evaluators, etc., with minimal boilerplate.

Installation

pip install "smolagents[toolkit]"  # includes default tools like web search

This matches the official quickstart.

3. CodeAgent vs ToolCallingAgent (and when to use which)

CodeAgent – the “think in code” agent

A CodeAgent does this:

The LLM receives the tools’ APIs and the task.
It proposes Python code that:
- Calls tools
- Does control flow
- Aggregates results
The framework executes that code in a sandboxed Python environment.

Minimal example from the quickstart, adapted:

from smolagents import CodeAgent, InferenceClientModel

model = InferenceClientModel()  # HF Inference, default model
agent = CodeAgent(tools=[], model=model)

print(agent.run("Calculate the sum of numbers from 1 to 10"))

This matches the official docs structure and APIs.

When to use CodeAgent:

You want complex control flow or data wrangling.
You want to inspect / debug the generated reasoning as Python.
You want maximal flexibility to combine tools, APIs, and logic.

ToolCallingAgent – classic JSON tool calling

A ToolCallingAgent behaves more like OpenAI tool calling:

The LLM returns structured tool calls.
smolagents handles the JSON → Python function plumbing.

From the guided tour:

from smolagents import ToolCallingAgent, InferenceClientModel

model = InferenceClientModel()
agent = ToolCallingAgent(tools=[], model=model)

agent.run("Explain how you would solve 24 * 7 without a calculator.")

When to use ToolCallingAgent:

You’re integrating with existing tool-calling strategies.
You want less power than arbitrary Python execution (easier to reason about).
You’re mostly orchestrating external APIs, not doing heavy local logic.

How we think about choosing

Prototyping + research → start with CodeAgent (fast to iterate and extremely expressive).
Production flows that require strict control → often combine:
- ToolCallingAgent for constrained external calls, and
- CodeAgent in a stricter sandbox for heavier computation.

We’ll show an example of this pattern later.

4. Getting started: a web-searching code agent

Let’s wire up a basic agent that uses web search plus Python reasoning.

From the docs, there’s a built-in DuckDuckGo search tool:

from smolagents import CodeAgent, InferenceClientModel, DuckDuckGoSearchTool

model = InferenceClientModel()  # uses your HF_TOKEN under the hood

agent = CodeAgent(
    tools=[DuckDuckGoSearchTool()],
    model=model,
)

answer = agent.run("What's the capital of Nigeria, and what's 3x its population (approx)?")
print(answer)

What happens:

The agent writes code that calls DuckDuckGoSearchTool.
It parses the result, extracts numbers, and does the math in Python.
You get a single final answer string, plus logs (agent.logs) you can introspect.

Developer tip:
Use agent.logs and agent.write_inner_memory_from_logs() when debugging. The latter converts logs into LLM-readable messages so a second pass can “reflect” on a run.

5. Building your own tools (the right way)

You’ll spend most of your time here.

Option 1: Decorated function with `@tool`

This is the idiomatic pattern from the docs:

from huggingface_hub import list_models
from smolagents import tool

@tool
def most_downloaded_model(task: str) -> str:
    """
    Return the most downloaded model name for a given task on the HF Hub.
    """
    models = list_models(filter=task, sort="downloads", direction=-1)
    model = next(models)
    return model.id

Then:

from smolagents import CodeAgent, HfApiModel  # model abstraction for HF Inference

model = HfApiModel()  # alias of InferenceClientModel in some versions
agent = CodeAgent(tools=[most_downloaded_model], model=model)

print(agent.run(
    "Which model has the most downloads for the 'text-classification' task?"
))

smolagents inspects the function signature and docstring to generate the tool schema the LLM will see.

Option 2: Subclass `Tool`

Use this if:

You need heavy initialization (e.g. DB connection, embedding model).
You want custom validation, timeouts, caching, etc.

Pattern (simplified from docs):

from smolagents import Tool

class SqlQueryTool(Tool):
    name = "sql_query"
    description = "Run a read-only SQL query against the analytics DB."

    inputs = {"query": {"type": "string", "description": "The SQL query to run"}}
    output_type = "string"

    def __init__(self, engine):
        super().__init__()
        self.engine = engine

    def forward(self, query: str) -> str:
        # IMPORTANT: enforce read-only, parameterized queries, etc.
        with self.engine.connect() as conn:
            # you'd typically parse & validate the query here
            rows = conn.execute(query).fetchmany(50)
        return str(rows)

We strongly recommend hard-coding any security constraints (read-only, whitelists, row limits) inside the tool.

6. Secure code execution: how to not brick your infra

Let’s address the scary part: we’re letting an LLM write code and then executing it.

smolagents tackles this with multiple layers:

Sandboxed executor
The LLM’s code runs in a dedicated Python environment with:
- Restricted imports
- Strict network & filesystem controls (depending on backend)
- Timeouts and error handling
Whitelisted imports via additional_authorized_imports

from smolagents import CodeAgent, HfApiModel

model = HfApiModel()
agent = CodeAgent(
    tools=[most_downloaded_model],
    model=model,
    additional_authorized_imports=["requests", "bs4"],  # explicit whitelist
)

Configurable executor backends via executor_type‍

In recent versions, you configure the execution environment with executor_type

agent = CodeAgent(
    tools=[most_downloaded_model],
    model=model,
    executor_type="e2b",           # or "docker", "blaxel", "modal", "local"
    additional_authorized_imports=[],
)

Rough mental model:

‍"local" (or default): local Python sandbox (fast, but shares machine).‍
"e2b" / "blaxel" / "modal" / "docker": remote or containerized execution with tighter OS-level isolation.
There’s also a WebAssembly-based Pyodide+Deno sandbox option for browser-like security in some configs.‍

‍Illegal operations fail fast

‍Attempts to:

Access disallowed paths
Open external sockets (if disabled)
Import non-whitelisted modules

…will fail with an exception that you can catch and log or surface to the user.

Practical security checklist

For a production-ish environment, we’d:

Use executor_type="docker" or "e2b"/"blaxel" for strong isolation.
Start with additional_authorized_imports=[] and only add what you truly need.
Wrap any sensitive capability (DB, filesystem, external APIs) in narrow tools with guardrails inside the tool implementation.
Log all code that gets executed (or at least hashes / metadata) for auditability.

7. Example: a developer “research & prototype” agent

Let’s build something a senior engineer would actually use:

“Given a GitHub repo and a natural language request, figure out where to make changes and propose a patch.”

We’ll keep it simple but realistic:

Tool 1: GitHub file fetcher (stubbed)
Tool 2: Embeddings-based snippet retriever (stubbed)
A CodeAgent that orchestrates both and writes a patch.

Tools (simplified)

from smolagents import tool

@tool
def fetch_repo_file(repo: str, path: str, ref: str = "main") -> str:
    """
    Fetch the content of a file from a GitHub repo.
    Args:
        repo: "owner/repo" string.
        path: file path in the repo.
        ref: branch or commit sha.
    """
    # In production, use GitHub API with proper auth & rate limiting.
    import requests

    url = f"https://raw.githubusercontent.com/{repo}/{ref}/{path}"
    resp = requests.get(url, timeout=10)
    resp.raise_for_status()
    return resp.text


@tool
def search_codebase(repo: str, query: str, top_k: int = 5) -> str:
    """
    Return up to top_k code snippets that seem related to `query`.
    Currently a stub; in prod, you'd use an embedding-based index.
    """
    # For now, we just say "not implemented" and let the agent handle it.
    return f"Search for '{query}' in repo '{repo}' is not implemented yet."

Agent

from smolagents import CodeAgent, InferenceClientModel

model = InferenceClientModel(
    model_id="Qwen/Qwen2.5-Coder-32B-Instruct"  # strong coding model
)

dev_agent = CodeAgent(
    tools=[fetch_repo_file, search_codebase],
    model=model,
    executor_type="docker",  # safer than local for arbitrary code
    additional_authorized_imports=["requests"],
)

task = """
We want to add basic rate limiting to our FastAPI endpoints in the repo
`cohorte/example-api`. Find the main router file and propose a patch
that adds a simple in-memory rate limiter for the `/v1/completions` route.
Return a unified diff.
"""

print(dev_agent.run(task))

Why this is nice for developers:

The LLM can:
- Call search_codebase (eventually embedding-backed).
- Use fetch_repo_file to fetch specific files.
- Write Python code to transform strings into unified diffs.
You can inspect the exact code it ran to build that patch (critical for trust).

8. Example: multi-agent system with a manager and specialists

smolagents doesn’t force a huge graph engine on you, but it’s perfectly capable of multi-agent setups. Many GAIA benchmark projects use smolagents as a base: a manager agent, retrieval specialist, logic specialist, browser specialist, etc.

Let’s sketch a minimal version:

Three agents

ManagerAgent: breaks down tasks and delegates.
ResearchAgent: web search + summarization.
CodingAgent: writes Python or shell-like plans.

We’ll use:

ToolCallingAgent for the manager (no code execution).
CodeAgent for research & code.

Research agent

from smolagents import CodeAgent, InferenceClientModel, DuckDuckGoSearchTool

research_model = InferenceClientModel()
research_agent = CodeAgent(
    tools=[DuckDuckGoSearchTool()],
    model=research_model,
    executor_type="local",  # safer if tools are simple
)

Coding agent

from smolagents import CodeAgent, InferenceClientModel

coding_model = InferenceClientModel(
    model_id="meta-llama/Llama-3.3-70B-Instruct"
)

coding_agent = CodeAgent(
    tools=[],  # you can add internal tools here
    model=coding_model,
    executor_type="e2b",  # strong sandbox for arbitrary code
)

Manager (ToolCallingAgent calling agents as tools)

We expose “call_research_agent” and “call_coding_agent” as tools that invoke the two other agents. This pattern is used in real multi-agent repos built on smolagents.

from smolagents import ToolCallingAgent, tool, InferenceClientModel

@tool
def research_tool(question: str) -> str:
    """
    Ask the research agent to investigate a question using web search and
    return a concise, cited summary.
    """
    return research_agent.run(
        f"Research this question and answer concisely with sources:\n{question}"
    )

@tool
def coding_tool(spec: str) -> str:
    """
    Ask the coding agent to write Python code that implements the spec.
    Return the code only.
    """
    return coding_agent.run(
        f"Write Python code only, no explanation, for this spec:\n{spec}"
    )

manager_model = InferenceClientModel()
manager_agent = ToolCallingAgent(
    tools=[research_tool, coding_tool],
    model=manager_model,
)

print(manager_agent.run(
    "Build a small Python script that prints the latest AI news headline "
    "and the number of characters in it."
))

Flow:

Manager decides: call research_tool to get latest AI news.
Manager then uses coding_tool with a spec describing what to code.
You, or another orchestrator, decide how to execute that code (e.g., via the coding agent, CI pipeline, or human review).

9. How smolagents compares to other frameworks

We use a bunch of frameworks in practice; here’s how we’d compare them, focusing on developer ergonomics and control, not who’s “best”.

vs LangChain

LangChain:

Huge ecosystem (RAG, tools, retrievers, memory, eval, integrations).
Heavy abstraction—very powerful, but can feel “enterprisey”.

smolagents:

Much smaller surface area; easier to grok end-to-end.
Tight focus on agents + tools + code execution.
You’d typically pair it with LangChain for RAG, or export LangChain tools as MCP / Python tools.

Rule of thumb:

Use LangChain when you need sophisticated retrieval, vector stores, and a big ecosystem.
Use smolagents when you want crisp, inspectable agent logic that literally runs as code.

vs LangGraph

LangGraph:

Graph engine for LLM workflows (nodes, edges, state machines, retries).
Amazing for long-running workflows, human-in-the-loop, and complex DAGs.

smolagents:

Doesn’t try to be a full DAG engine.
You can still build multi-agent graphs, but using plain Python orchestration.
Great for “I just want a powerful agent right now without designing a whole graph”.

We’ve seen teams use LangGraph to orchestrate multiple smolagents as “leaf” nodes where code execution is needed.

vs CrewAI / AutoGen / custom frameworks

CrewAI / AutoGen:

Strong focus on multi-agent collaboration and role-based agents.
Often include planning templates, conversational protocols, etc.

smolagents:

More “bring your own collaboration patterns”.
Manager/worker, GAIA-style pipelines, deep research agents already exist in public repos, but they’re still just Python using CodeAgent/ToolCallingAgent.

If you want:

Prescriptive multi-agent UX → frameworks like CrewAI can be nice.
Maximum flexibility with real code as the final source of truth → smolagents shines.

10. Implementation tips that save you hours

10.1 Choose models explicitly

The defaults are fine for demos, but in practice:

from smolagents import InferenceClientModel, LiteLLMModel, OpenAIModel

# HF Inference - good general choice, variety of OSS models
hf_model = InferenceClientModel(model_id="Qwen/Qwen2.5-Coder-32B-Instruct")

# OpenAI via dedicated wrapper (needs smolagents[openai])
openai_model = OpenAIModel(model_id="gpt-4.1-mini")

# Or via LiteLLM (smolagents[litellm]) for many providers in one
litellm_model = LiteLLMModel(model_id="gpt-4.1-mini")

This matches the model abstractions in the docs.

Tip: use cheaper models for “glue” agents (manager, routing, simple classification) and focus spend on the code-heavy or research-heavy agents.

10.2 Telemetry & observability from day one

smolagents gives you:

agent.logs: per-step info (LLM prompts, tool calls, outputs).
agent.write_inner_memory_from_logs(): condensed “memory transcript” you can feed into another LLM to reflect, debug, or summarize an agent run.

Patterns we like:

Persist logs to your existing logging stack (OpenTelemetry, Datadog, etc.).
Sample 10–20% of production runs and store full logs for offline analysis.
Build a small internal “agent run explorer” that shows:
- Steps
- Generated code
- Errors
- Tool outputs

This pays off fast when something behaves weirdly in production.

10.3 Memory & state: don’t overcomplicate it early

smolagents has a memory tutorial and utilities for persistent state.

Our take:

Start stateless (task in, answer out).
Then layer:
- a short-term memory (last N interactions)
- plus explicit tools for loading & saving user or project state.
Only later add “always-on long-term memory” if you have a validated use case.

Over-eager memory tends to create debugging nightmares.

10.4 When in doubt, push complexity into tools

If your agent’s prompts get long and fragile, that’s usually a smell that a tool should exist.

Examples:

API clients (Hub, Jira, Notion, internal services)
DB access (with strong guardrails)
Retrieval (vector DBs, search engines)
Heavy pure-Python data transformations

Let the LLM:

Decide which tool(s) to call in what order.
Decide how to glue their outputs together.

But don’t let it re-invent your business logic or security posture.

11. Common pitfalls & how to avoid them

Pitfall 1: “It’s fine, it’s just a prototype” (no sandbox)

Issue: Running CodeAgent with local execution and broad imports on a shared dev machine or server.

Fix:

For any environment with real data or credentials:
- Use executor_type="docker" or "e2b"/"blaxel".
- Keep additional_authorized_imports strict.
- Wrap secrets behind tools.

Pitfall 2: Giving the agent infinite power via imports

Issue: Letting the agent import subprocess, os, shutil, etc., directly.

Fix:

Don’t allow sensitive imports.
Expose safe wrappers as tools for any low-level operations you actually need.

Pitfall 3: Debugging by vibes only

Issue: “Sometimes it works, sometimes it doesn’t, not sure why.”

Fix:

Inspect agent.logs after runs.
Promote “debug transcripts” to a standard debugging artifact.
For complex systems, build a small CLI/Gradio UI that shows each step and code snippet.

Pitfall 4: Trying to build a full orchestration layer inside a single agent

Issue: Giant prompts and monolithic agents that handle planning, execution, evaluation, and retries.

Fix:

Split responsibilities:
- Planner / manager
- Executors (CodeAgent, ToolCallingAgent)
- Evaluators / critics (possibly also agents)
Orchestrate them with plain Python at first.
If complexity explodes, consider layering in LangGraph.

12. Key takeaways (for busy AI VPs & tech leads)

If you skimmed everything else, here’s the short version:

CodeAgent is a superpower
Letting agents “think in code” gives you natural loops, conditionals, and function composition, while keeping everything auditable as Python.
Security is manageable with the right setup
Use executor_type with remote sandboxes (Docker, Modal, E2B, Blaxel), strict import whitelists, and narrow tools around sensitive resources.
smolagents plays well with the rest of your stack
It’s model-agnostic and tool-agnostic (MCP, LangChain tools, Spaces as tools), so you don’t have to pick a side in “framework wars”.
Great fit for high-leverage teams
- Research & prototyping agents
- Advanced RAG agents (GAIA-style reasoning, open-deep-research setups)
- Multi-agent architectures where you really want to inspect and control each step
The winning pattern is boring and reliable
- Tools encapsulate security & business logic.
- CodeAgents glue those tools together with code you can read.
- Orchestration is plain Python; evaluation and telemetry are first-class citizens.

If we had to sum it up:

Use smolagents when you want agents to be code, not just call code.

It’s minimal enough that your team can understand it in an afternoon, and powerful enough to power serious GAIA-level multi-agent systems with the right models and tooling.

Tega AdeyemiDecember 8, 2025.

‍

1. Why “agents that think in code” is a big deal

2. Quick mental model of smolagents

Installation

3. CodeAgent vs ToolCallingAgent (and when to use which)

CodeAgent – the “think in code” agent

ToolCallingAgent – classic JSON tool calling

How we think about choosing

4. Getting started: a web-searching code agent

5. Building your own tools (the right way)

Option 1: Decorated function with @tool

Option 2: Subclass Tool

6. Secure code execution: how to not brick your infra

Practical security checklist

7. Example: a developer “research & prototype” agent

Tools (simplified)

8. Example: multi-agent system with a manager and specialists

Three agents

Research agent

Coding agent

Manager (ToolCallingAgent calling agents as tools)

9. How smolagents compares to other frameworks

vs LangChain

vs LangGraph

vs CrewAI / AutoGen / custom frameworks

10. Implementation tips that save you hours

10.1 Choose models explicitly

10.2 Telemetry & observability from day one

10.3 Memory & state: don’t overcomplicate it early

10.4 When in doubt, push complexity into tools

11. Common pitfalls & how to avoid them

Pitfall 1: “It’s fine, it’s just a prototype” (no sandbox)

Pitfall 2: Giving the agent infinite power via imports

Pitfall 3: Debugging by vibes only

Pitfall 4: Trying to build a full orchestration layer inside a single agent

12. Key takeaways (for busy AI VPs & tech leads)

Option 1: Decorated function with `@tool`

Option 2: Subclass `Tool`