Engineering21 min read

Part 3: Ollama for AI Model Serving

Ollama isn’t just an interactive tool—it can be a full-fledged AI service. In this article, we explore how to set up Ollama for model serving, turning it into a continuously running API that processes requests like OpenAI’s service—except on your own infrastructure. You’ll learn how to optimize performance, implement a simple serving setup with code, and discover real-world use cases where this approach makes sense. Let’s dive in.

Tega Adeyemi
Tega Adeyemi
Part 3: Ollama for AI Model Serving

Thus far, we’ve focused on using Ollama interactively and in various integrations. In this article, we zero in on using Ollama as a service for AI model serving and hosting. This means running Ollama in a way that it continuously serves requests (potentially from other applications or users), much like how an API service like OpenAI’s would operate, but on your own infrastructure. We will explore how to set up Ollama for model serving, strategies to optimize performance for this purpose, and walk through a step-by-step implementation of a simple model serving scenario with code. We’ll also discuss real-world applications and scenarios where serving a model via Ollama makes sense.

Wiring Ollama into a production API layer (auth, observability, queueing, fallbacks) is exactly the territory of Cohorte's AI Engineering Foundations course (E1).

AI model serving typically involves making your model available as a service (usually via an API) so that it can be integrated into applications (web apps, chatbots, etc.) or accessed by multiple clients. Ollama is well-suited to this role on a small to medium scale, given its built-in API endpoints and lightweight footprint. Let’s dive into how to leverage Ollama in this way.

Using Ollama for AI Model Serving and Hosting

To serve an AI model means to have it running and ready to accept queries at any time, much like a web server continuously running to handle web requests. With Ollama, the core piece of this is the ollama serve command. When you run ollama serve, you launch the Ollama server process that listens for API requests on a port (default is 11434)​. Essentially, Ollama becomes an HTTP server that you can send requests to in order to generate text from a model.

Key points about Ollama’s serving mode and API:

ollama serve

In summary, using Ollama for model serving involves running it as a persistent service and sending it HTTP requests. It transforms from a CLI tool you use manually into a backend service continuously listening for work.

Optimization Techniques for Better Performance

Serving a model efficiently requires some optimization to ensure responses are as fast as possible and the system is stable. Here are important optimization techniques and considerations when using Ollama:

By applying these techniques, you can significantly improve Ollama’s performance as a model server. Many users have reported that with quantization and a decent CPU, they can get response times in the order of a couple seconds for smaller models, which is sufficient for interactive use. With a GPU, responses can be near real-time for medium models. Tuning is an iterative process – you might experiment with different model variants or settings to hit the sweet spot of speed vs. output quality that your application needs.

Step-by-Step Implementation Guide (with Code Examples)

Let’s walk through setting up a simple AI model serving scenario with Ollama. In this example, we’ll assume you want to serve a model that answers questions (a Q&A or chat model) and that you’ll interact with it via HTTP requests (say from a web app or a curl command).

Step 1: Start the Ollama Server

First, ensure you have pulled the model you want to serve (for example, “mistral” or “llama2”). Then start the server:

$ ollama serve

This will start Ollama in server mode, typically bound to localhost:11434. You might run this on a server machine, in which case consider running it as a background process. If using Linux, you could do:

$ nohup ollama serve &> ollama.log &

This will run it in the background and log output to ollama.log. Once running, the server is awaiting requests.

Step 2: Construct a Request

We will use the API to ask the model something. The simplest way to test is using curl from a terminal or a tool like Postman.

Using Ollama’s own API endpoint (/api/generate):

curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
        "model": "mistral",
        "prompt": "What is the capital of France?",
        "system": "",
        "options": {
            "temperature": 0.7,
            "max_tokens": 100
        }
      }'

Let’s break down this JSON:

This curl command posts the request and should get back a JSON response, which will contain the model’s answer. A possible response might look like:

{
  "response": "The capital of France is Paris.",
  "model": "mistral",
  "created_at": "2025-03-17T08:48:00Z"
}

The exact format can vary, but essentially you get the text in a field (here "response"). If there was an error (like model not found or out-of-memory), you’d get an error code and message.

Using the OpenAI-compatible endpoint (/v1/chat/completions):

Alternatively, do:

curl -X POST http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer dummy" \
  -d '{
        "model": "mistral",
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ]
      }'

Here we include an Authorization header because the OpenAI API usually requires one, but Ollama will ignore the token value. In the JSON, we specify the model and provide a conversation as a list of messages with roles (just one user message in this case). The response will likely be in the format:

{
  "id": "cmpl-xyz123...",
  "object": "chat.completion",
  "created": 1700000000,
  "model": "mistral",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ]
}

This mirrors OpenAI’s response format, which is nice if you have existing code expecting that format. The important part is in choices[0].message.content.

Step 3: Integrate into an Application

Now that we can curl the server, integrating into an app is straightforward. For example, in Python using the requests library:

import requests
import json

url = "http://localhost:11434/api/generate"
payload = {
    "model": "mistral",
    "prompt": "Tell me a joke about cats.",
    "options": {"max_tokens": 50}
}
response = requests.post(url, headers={"Content-Type": "application/json"}, data=json.dumps(payload))
if response.status_code == 200:
    data = response.json()
    print("Model says:", data.get("response"))
else:
    print("Error:", response.status_code, response.text)

This snippet sends a prompt to the model and prints the response. In a web app (say Flask or Node.js), you would do similar calls on the server side whenever you need the model’s output. For instance, an endpoint /ask in your Flask app can take a question from the frontend, call the Ollama API, then return the answer to the frontend.

Step 4: Enable Streaming (if needed)

To illustrate streaming (getting the response token by token), if using the OpenAI endpoint with their Python SDK:

import openai
openai.api_base = "http://localhost:11434"
openai.api_key = "unused"
response = openai.ChatCompletion.create(
    model="mistral",
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
    stream=True
)
for chunk in response:
    chunk_message = chunk['choices'][0]['delta'].get('content', '')
    print(chunk_message, end='', flush=True)

This will print the answer as it arrives (the OpenAI streaming returns incremental delta objects). Using streaming in your actual app could mean pushing partial results via websockets to a web client, or simply showing a typing indicator.

Step 5: Optimize as per earlier section

If you find the responses slow, try some optimizations:

Step 6: Scale if necessary

If you deploy this and start getting more traffic:

Example Real-world Implementation:

Imagine you build a simple QA web service. A user goes to a webpage, enters a question, and hits ask. Your frontend JavaScript sends an AJAX request to your backend /ask with the question. Your backend (say a Python Flask app) receives it and calls Ollama’s API as shown above, then sends the answer back to the frontend, which displays it. If you implement streaming, you could open a WebSocket or server-sent events so the answer appears word-by-word to the user. The user experiences something very akin to using a cloud AI service, but everything is powered by your Ollama instance.

It’s straightforward but powerful: with under 100 lines of code, you can create your own mini ChatGPT-like service internally.

To finalize this section, let’s include a concise code example of a Flask-like pseudocode tying it together:

from flask import Flask, request, jsonify
import requests

app = Flask(__name__)
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL_NAME = "mistral"

@app.route('/ask', methods=['POST'])
def ask():
    user_question = request.json.get('question')
    payload = {"model": MODEL_NAME, "prompt": user_question}
    try:
        resp = requests.post(OLLAMA_URL, json=payload, timeout=30)
        data = resp.json()
        answer = data.get("response") or data.get("choices")[0]["message"]["content"]
        return jsonify({"answer": answer})
    except Exception as e:
        return jsonify({"error": str(e)}), 500

# Running the Flask app (in production, use gunicorn or similar)
if __name__ == "__main__":
    app.run(port=5000)

This creates an /ask endpoint. A client can POST {"question": "What is 2+2?"} and get back {"answer": "4"} (for example). Of course, in practice you'd add security, user auth, etc., but the core is just forwarding the prompt to Ollama.

Real-World Applications of AI Model Serving with Ollama

By serving models via Ollama, you unlock numerous application possibilities. Some real-world applications include:

It’s inspiring to see the creative ways people apply local model serving. Many started doing so because of either privacy or cost motivations, but it has opened up new possibilities (like truly offline AI experiences).

One real example to highlight: There have been projects using local LLMs to control robotics. A research team connected an Ollama-served model to a robot arm’s control system. They would describe a task in natural language, the model would generate step-by-step instructions, and those instructions would be executed by the robot (with some safety checks). Running the model locally was crucial for real-time control and also to ensure safety (no latency or unpredictability due to network issues). This is a niche application, but it shows that once you have an LLM available as a service on a machine, you can connect it to just about anything.

Another notable application: combining with RAG (Retrieval Augmented Generation). If you serve both an embedding model and a chat model via Ollama, you can create a pipeline where a user’s query is first used to fetch relevant documents (via vector search using the embedding model), then those documents are given to the chat model to formulate an answer. This could be the backend of a custom search engine or a specialized assistant (like legal document Q&A, or technical documentation assistant). All of this can be done with Ollama serving the necessary models and some glue code for the retrieval part.

Final Thoughts

Ollama’s ability to serve AI models transforms it from a developer’s tool into a full-fledged AI service platform on a small scale. We’ve discussed how to set up the serving functionality, optimize it, and integrate it into applications. By hosting models through Ollama, you effectively become your own AI provider, which comes with responsibilities (maintaining the service) but also tremendous benefits (privacy, control, customization).

The real-world applications are diverse – from enterprise solutions to innovative consumer devices, from educational tools to research projects. Many of these would have been difficult or impossible without a local serving capability, especially where internet is unreliable or data must remain secure.

It’s important to remember that while Ollama can handle the serving role, you should always test and ensure it meets the demands of your particular application. If your needs grow, you might combine Ollama with more scalable infrastructure or look at dedicated model serving frameworks. But in a large number of cases, Ollama is perfectly sufficient and significantly simpler to get running.

To wrap up, if you’ve followed along from Article 1 through 3, you’ve essentially learned to: set up and use Ollama, integrate it with other tools, and deploy it as a service. The final article in this series will target developers and ML engineers specifically, discussing how Ollama fits into their workflows and how it can streamline development and deployment of AI models.

Tega AdeyemiMarch 19, 2025