Using Ollama with Python: Step-by-Step Guide

Overview of Ollama
Ollama is an open-source tool that makes it easy to run large language models (LLMs) on your local machine. Its purpose is to streamline the use of open-source LLMs (such as Llama 2, Mistral, Falcon, etc.) without complex setup. Ollama provides a user-friendly platform that bundles model weights, configurations, and even datasets into a unified package (managed by a Modelfile) for each model. In essence, it works similarly to Docker for AI models – you can pull pre-built model packages and run them locally with minimal hassle. Key features of Ollama include a library of ready-to-use models, a simple command-line interface (CLI) and REST API, and support for customization (you can even create your own model packages). This enables developers to harness powerful LLMs entirely offline and under their control.
Benefits of Using Ollama
Why should developers use Ollama? Running LLMs locally with Ollama offers several advantages, especially in terms of efficiency, flexibility, and scalability:
• Efficiency (Speed & Cost): By keeping inference on local hardware, Ollama eliminates network latency and can deliver faster responses for LLM queries . There’s no need to send data to external servers, which not only speeds up interactions but also cuts costs – you avoid paying for cloud API usage or subscriptions. Ollama is optimized for resource usage (e.g. via llama.cpp under the hood), enabling quick, lightweight execution of models on CPUs or GPUs. In short, you get low-latency performance and cost-efficient AI since you leverage your own machine’s resources.
• Flexibility & Control: Ollama gives you full control over your models and data. You can choose from a wide range of open-source models or fine-tune and swap models as needed, all within your environment. Because everything runs locally, data privacy is ensured – sensitive information never leaves your machine, an important benefit for healthcare, finance, or other data-sensitive domains. You can also integrate Ollama with other tools and frameworks (for example, using it alongside TensorFlow, PyTorch, or LangChain) to fit it into your existing workflow. This means developers can customize how the LLM operates, define special system prompts or rules, and even extend the model’s capabilities with plugins or functions (as we’ll see later). The result is a highly adaptable setup tailored to your project’s needs.
• Scalability: Even though Ollama runs on a local machine, it is designed to scale with your requirements. As your needs grow, you can upgrade hardware or deploy Ollama on multiple machines (or containers) to handle bigger models or higher loads. There’s no dependency on a third-party service’s limits – you can scale up by pulling more capable models (from 7B parameters to 70B or more) or running several model instances in parallel. Ollama’s efficient local deployment can also be used in cloud or on-premise servers for enterprise scaling. In fact, it’s possible to integrate Ollama into a Kubernetes or server cluster for a robust, scalable AI solution that you control (some community projects even explore an “Ollama Cloud” for cluster deployments). This flexibility means you can start small and expand your LLM-powered application without completely changing your stack.
In summary, Ollama’s local-first approach provides enhanced privacy, no recurring API fees, and the freedom to experiment or customize deeply. These benefits make it an attractive choice for developers who want to incorporate AI capabilities efficiently and reliably in their applications .
Getting Started
Getting started with Ollama involves two parts: installing the Ollama engine (which runs the models) and setting up the Python library to interface with it.
Installation of Ollama (Engine/CLI): Ollama supports macOS, Linux, and Windows . For macOS and Linux, installation is straightforward. You can download the installer from the official Ollama website or use a one-line shell command. For example, on macOS you can use Homebrew or on Linux run the provided script:
# On macOS (using Homebrew):
brew install ollama
# On Linux (using the official install script):
curl -sS https://ollama.ai/install.sh | bash
This will install the Ollama CLI. After installation, start the Ollama service (on macOS you might run brew services start ollama, or simply running any Ollama command will start its background server). You can verify it’s installed by checking the version:
ollama pull llama2
This command will download the model weights and set it up locally (the first time you run a model, Ollama will pull it automatically if not present). Make sure you have enough disk space and RAM for the model you choose; for instance, the 7B Llama2 model might require ~16GB RAM to run comfortably . You can list available models with ollama list and see details of a model with ollama show <model-name>.
Installing the Python Library: With the Ollama engine ready and a model available, the next step is to install the Python SDK for Ollama. This library allows Python code to communicate with the Ollama backend via its REST API. Install it using pip:
pip install ollama
This gives you the ollama Python package (make sure you’re using Python 3.8+ as required). Now you can interact with the local models from your Python scripts or applications.
Running the First Example: Let’s test that everything is set up correctly by running a simple generation. In a Python interpreter or script, try the following:
import ollama
# Use the generate function for a one-off prompt
result = ollama.generate(model='llama2', prompt='Why is the sky blue?')
print(result['response'])
In this code, we call ollama.generate with a model name (here "llama2" which we pulled earlier) and a prompt string. The model will process the prompt and the result (a dictionary) contains the model’s answer under the 'response' key. For example, the output might be a scientific explanation of why the sky appears blue. If everything is configured properly, you should see the answer printed in the console.
Using the Chat API: Ollama also provides a chat interface for conversation-style interactions. For instance:
from ollama import chat
conversation = [
{"role": "user", "content": "Hello, how are you?"}
]
reply = chat(model='llama2', messages=conversation)
print(reply.message.content)
This uses a chat-based model (like an instruction-tuned Llama 2) to respond to a user message. The messages parameter takes a list of dialogue turns with roles ("user", "assistant", etc.). The ollama.chat function returns a response object where reply.message.content holds the assistant model’s latest reply. Using this, you can build interactive applications easily. (We will expand on chatbots in the next section.)
Note: The first time you use a model, there may be a download delay. After that, responses will stream directly from your local machine. You can also enable streaming in the Python API by setting stream=True if you want token-by-token output for responsiveness.
Example Use Cases
Ollama can power various AI applications. Below we explore two common Python-based use cases to demonstrate its capabilities.
Use Case 1: AI Chatbot
One of the most straightforward uses of Ollama is to create an AI chatbot. Because Ollama can run conversational models (like Llama-2-Chat or other instruction-tuned models) locally, you can build a private ChatGPT-style assistant.
Scenario: Imagine building a customer support chatbot or a personal assistant that runs entirely offline. Ollama’s local LLM will handle the natural language understanding and response generation.
How to build it: You would maintain a conversation history and continually send it to ollama.chat as the user interacts. For example:
import ollama
# Choose a chat-capable model (ensured it is pulled)
model_name = 'llama2'
# Initialize conversation with a system prompt (optional) and a user message
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
]
# First response from the bot
response = ollama.chat(model=model_name, messages=messages)
print("Bot:", response.message.content)
# Continue the conversation:
while True:
user_input = input("You: ")
if not user_input:
break # exit loop on empty input
messages.append({"role": "user", "content": user_input})
response = ollama.chat(model=model_name, messages=messages)
answer = response.message.content
print("Bot:", answer)
messages.append({"role": "assistant", "content": answer})
In this code:
• We start with a system role that defines the assistant’s behavior (here simply instructing it to be helpful) and an initial user greeting.
• We call ollama.chat with the conversation so far, and print the bot’s reply.
• Then we enter a loop to keep reading user input and sending updated messages back to ollama.chat. We always append the newest user message and the assistant’s reply to the message list to maintain context.
• The loop breaks on empty input (as a way to exit).
This simple chatbot will allow you to have a back-and-forth dialogue with the LLM entirely through Python. Each call to ollama.chat returns the model’s newest reply given the conversation context. Because the model runs locally, responses come as fast as your hardware allows, and you have full privacy in your conversations. Such a chatbot could be extended with a GUI or a web interface, but the core logic would remain the same. The Ollama Python library abstracts the heavy lifting, so building a basic chat application is only a few lines of code.
Use Case 2: Automation in Workflows
Beyond chatbots, Ollama can be used to automate tasks and enhance workflows by leveraging AI capabilities. For example, you might use an LLM to summarize documents, generate reports, assist in coding, or interpret commands in an automation script. Essentially, Ollama lets you embed an AI “brain” into your Python projects.
Scenario: Suppose you have a daily workflow of reading through lengthy log files or meeting transcripts. You can use Ollama to automate summarization of these texts. This saves time by having the AI highlight key points for you.
How to do it: You can prompt an LLM to summarize or analyze text and integrate that into your pipeline. For instance:
import ollama
# Example: Summarize a paragraph of text
text = """
OpenAI has introduced a new tool called Ollama that lets users run large language models on local machines.
This approach emphasizes privacy and control, as data does not leave the user's environment.
Developers can leverage various open-source models through a simple interface, improving efficiency and reducing costs.
"""
prompt = f"Summarize the following text in one sentence:\n\"\"\"\n{text}\n\"\"\""
result = ollama.generate(model='llama2', prompt=prompt)
print("Summary:", result['response'])
Here we took a piece of text and constructed a prompt asking the model to summarize it in one sentence. The ollama.generate function returns the summary, which might be something like: “Ollama is a new OpenAI tool that enables running large language models locally, giving users more privacy, control, and cost efficiency.” This kind of automation can be plugged into a larger script — for example, iterating over multiple documents and writing summaries to a file.
Another automation example could be code generation or assistance. Suppose you want to automate writing boilerplate code or configuration files. You could prompt the model with instructions and have it output code, which your Python program then saves to disk. For instance:
code_prompt = "Write a Python function that checks if a number is prime."
response = ollama.generate(model='codellama', prompt=code_prompt)
print(response['response'])
Using a code-specialized model (like codellama in this example) would return the code for a prime-checking function, which you could then use in your project. This shows how Ollama can enhance developer workflows by automating parts of coding or documentation tasks.
In general, integrating Ollama into automation means you can have AI-driven features in any Python workflow without external API calls. The flexibility of the Python SDK (with features like ollama.list(), ollama.pull(), ollama.delete(), etc.) also means your program can manage models on the fly – for example, pulling a required model at runtime or switching models for different tasks. This makes it possible to build intelligent agents, batch processors, or AI-powered assistants that streamline complex tasks.
Step-by-Step Example: Building a Simple Agent with Ollama
To bring everything together, let’s walk through building a simple AI agent using Ollama. Our agent will be a chatbot that can not only converse, but also perform a simple calculation using a Python function as a tool. This demonstrates how to use Ollama’s advanced features (function calling) in a step-by-step manner.
Step 1: Setup the Environment
Make sure Ollama is installed and running (as described in Getting Started). Also, ensure you have a suitable model pulled. For this agent, we’ll use an instruction-following model (for general Q&A) and enable it to use a tool for math. A model like llama2 or llama3.1 will work. (If not already done, run ollama pull llama2 in the terminal beforehand.) In your Python script or notebook, import the Ollama library and any other modules you need:
import ollama
If you plan to use a specific function from another library as a tool, you would import it as well (we’ll show an example with the Python requests library later). But for now, our agent’s tool will be a simple internal function.
Step 2: Define a Tool Function (Optional)
Ollama’s Python library allows you to pass in Python functions as tools that the model can call during a chat . This is great for creating agents that can take actions or fetch information. We’ll define a basic math function for our agent to use:
""" Tool function: add two numbers """
def add_two_numbers(a: int, b: int) -> int:
"""
Add two numbers and return the result.
"""
return a + b
This function simply takes two integers and returns their sum. (Notice we included type hints and a docstring; providing this metadata can help the LLM understand the tool’s purpose better.) We could register more functions if needed, but we’ll keep it simple.
Step 3: Initialize the Agent Conversation
Now, let’s set up the initial context for the agent. We’ll give it a system instruction that informs the model about the available tool and when to use it. For example, the system message might say: “You are a math assistant. If the user asks a math question, you can call the add_two_numbers function.” This guides the model to utilize the tool appropriately. We’ll also prepare a user query that will require the tool:
""" System prompt to inform the model about the tool is usage """
system_message = {
"role": "system",
"content": "You are a helpful assistant. You can do math by calling a function 'add_two_numbers' if needed."
}
# User asks a question that involves a calculation
user_message = {
"role": "user",
"content": "What is 10 + 10?"
}
messages = [system_message, user_message]
Step 4: Chat with the Model and Provide the Tool
We now call the Ollama chat API, providing the model name, the conversation messages, and our tool function in the tools parameter:
response = ollama.chat(
model='llama2',
messages=messages,
tools=[add_two_numbers] # pass the actual function object as a tool
)
When we include tools=[add_two_numbers], under the hood, the Ollama library makes the function’s signature and docstring available to the model. The model, upon seeing the user question “What is 10 + 10?”, can decide to call the add_two_numbers tool instead of trying to do math itself. Ollama’s latest version supports this kind of function call out-of-the-box.
Step 5: Handle the Tool Response
The result we get from ollama.chat is a response object that may include a tool call. We need to check if the model indeed requested to use our function. The response will have a property response.message.tool_calls which is a list of any tool invocations the model decided to make . We can process it like so:
if response.message.tool_calls:
for tool_call in response.message.tool_calls:
func_name = tool_call.function.name # e.g., "add_two_numbers"
args = tool_call.function.arguments # e.g., {"a": 10, "b": 10}
# If the function name matches and we have it in our tools, execute it:
if func_name == "add_two_numbers":
result = add_two_numbers(**args)
print("Function output:", result)
In this snippet, we loop through any tool calls (there could be multiple, but in our case we expect one). We match the function name and then call add_two_numbers with the arguments provided by the model. The result (in this case 20) is printed out or could be fed back to the model.
What about the model’s own answer? Typically, when an LLM uses function calling, it might initially respond with a placeholder or a reasoning like: “I will use the add_two_numbers tool.”. After executing the tool, you might send the result back into the conversation for the model to generate a final answer. For simplicity, we can assume the model’s answer will be completed after the function call. If we want the agent to explicitly return the final answer to the user, we could append a new message with the function result and prompt the model to conclude the answer.
Putting it all together, a simple agent loop might look like:
""" (Continuing from previous code) """
available_functions = {"add_two_numbers": add_two_numbers}
""" Model's initial response after possibly invoking the tool """
assistant_reply = response.message.content
print("Assistant (initial):", assistant_reply)
""" If a tool was called, handle it """
for tool_call in (response.message.tool_calls or []):
func = available_functions.get(tool_call.function.name)
if func:
result = func(**tool_call.function.arguments)
# Provide the result back to the model in a follow-up message
messages.append({"role": "assistant", "content": f"The result is {result}."})
follow_up = ollama.chat(model='llama2', messages=messages)
print("Assistant (final):", follow_up.message.content)
This way, the agent uses the tool and then the model concludes with the final answer. When you run the full script, the output might look like:
Assistant (initial): Let me calculate that for you...
Function output: 20
Assistant (final): The result is 20.
The agent successfully answered “What is 10 + 10?” by using a Python function for the calculation. You have effectively created a simple AI agent that can extend its capabilities beyond the LLM’s built-in knowledge, all running locally. You could add more tools or more complex logic similarly. For instance, you could integrate an API call (like a weather API) by providing a tool function that fetches weather data, and the model can decide to call it when the user asks about the weather.
Through this step-by-step example, we saw how to set up Ollama, load a model, and use the Python SDK to implement an agent that combines AI reasoning with real actions. The Ollama Python library makes it straightforward to take a concept from an idea (AI agent) to a working prototype with just a few dozen lines of code.
Final Thoughts
In this guide, we covered the fundamentals of using Ollama with Python: from understanding what Ollama is and why it’s beneficial, to setting it up, exploring key use cases, and building a simple agent. With Ollama, developers can run powerful language models locally and integrate them into applications with ease. The Python API is intuitive – you can get started with basic generate or chat calls, and then explore advanced features like custom system prompts, streaming responses, or function tools as needed.
By leveraging Ollama, you gain privacy (your data stays local), flexibility in choosing or even fine-tuning models, and potentially lower costs and latency. We demonstrated how to make an AI chatbot and automate a workflow task using Ollama. These examples just scratch the surface.
For next steps, you might want to experiment with different models from the Ollama library (e.g., try a code-focused model for programming assistance, or a larger 13B+ parameter model for more nuanced conversations). You can also integrate Ollama into existing AI frameworks – for example, using LangChain’s Ollama integration to build more complex agents or chains . The community and documentation (check out the official Ollama docs and GitHub) have many more examples, such as employing embeddings for semantic search or running Ollama in a web server mode to serve API requests.
With this solid foundation, you can expand your AI projects confidently. Whether it’s a personal assistant, a data analysis tool, or a custom chatbot for your website, Ollama empowers you to develop and scale these solutions on your own terms. Happy building!
Cohorte Team
March 3, 2025