Engineering38 min read

Run LLMs Locally with Ollama: Privacy-First AI for Developers in 2025

Run LLMs locally in 2025 with full data control. Explore Ollama’s latest features, real Python examples, and GDPR-ready AI workflows that scale.

Tega Adeyemi
Tega Adeyemi
Run LLMs Locally with Ollama: Privacy-First AI for Developers in 2025

Local large language models (LLMs) are rapidly gaining traction as businesses seek production-ready, privacy-conscious AI solutions. Especially in Europe, where GDPR and data sovereignty are paramount, keeping data on-premise has huge appeal. This guide introduces Ollama – a lightweight framework for running LLMs locally – focusing on its latest releases v0.8.0 and v0.9.0. We’ll explore how Ollama empowers developers and AI leaders to rapidly prototype and confidently deploy AI, all while maintaining full control over data. Along the way, we’ll include practical Python examples (for private chatbots, RAG pipelines, fine-tuning), compare Ollama to other local frameworks (like LM Studio, LMQL, OpenDevin) and hosted APIs (OpenAI, Claude, Mistral), and share tips for dev and production settings. Let’s dive into the world of local LLMs with Ollama – where your data stays home and your ideas go big!

Running a model locally is the easy part; productionising the stack around it is what we cover in Cohorte's AI Engineering Foundations course (E1).

What is Ollama?

Ollama is an open-source, extensible platform designed to simplify running open-source LLMs on your local machine. Think of it as a Docker-like toolkit for AI models – it packages everything needed (model weights, configuration, system prompts, etc.) into a single Modelfile, making model management and deployment straightforward. With a simple CLI and built-in REST API, Ollama allows you to pull pre-built models or create custom ones and start generating text immediately. It supports macOS, Linux, and Windows, so whether you’re a developer prototyping on a laptop or an engineering team deploying to a server, Ollama fits the bill.

Key characteristics of Ollama include:

In short, Ollama lets you run and manage LLMs locally with ease, offering a simple developer experience akin to using a cloud API – but everything stays on hardware you control.

What’s New in v0.8.0 and v0.9.0

Ollama’s recent versions 0.8.0 and 0.9.0 bring important enhancements geared towards better interactivity and debuggability – critical for both rapid prototyping and production use.

In summary, v0.8.0 and v0.9.0 enhance Ollama’s real-time capabilities (streaming tools, chain-of-thought) and continue to refine its robustness (logging, more models). These features target the needs of developers iterating quickly and enterprises pushing into production – faster feedback loops, better transparency, and more model choices.

Rapid Prototyping with Ollama

One of Ollama’s strengths is how quickly you can go from zero to a working LLM-powered prototype. If you’re a developer or an AI VP overseeing a team, you’ll appreciate that setup and iteration cycles are short. Here’s why Ollama shines for prototyping:

$ ollama pull llama2:7b  
$ ollama run llama2:7b  
>>> Hello, world!
  • This downloads the model (if not already on disk) and runs an interactive chat. No lengthy environment configs or cloud credentials – it just works. Need a different model? Pull it, run it. This lets you A/B test models (say, compare Llama 3 vs Mistral for quality) easily in the early stages.
  • Interactive Tuning: Because it’s local, you can play with prompt formats, system messages, and parameters on the fly. Modify your Modelfile to adjust the system prompt or temperature, recreate the model, and test again within seconds. For example, to prototype a fun chatbot persona, you could do:
  • # Modelfile
    FROM llama3.2
    SYSTEM "You are an expert travel guide with a sense of humor."
    PARAMETER temperature 0.7
  • Then create a new model with ollama create travelbot -f Modelfile and start chatting: ollama run travelbot. This quick cycle (edit config -> run) is great for prompt engineering and trying out domain-specific behaviors without waiting on an external service deployment.
  • Local Speed & Iteration: While massive cloud models can be slow or have rate limits, a decent local model running on your hardware gives immediate feedback. Especially for smaller prototypes or internal tools, the latency is low (no network calls) and you’re only limited by your hardware. You can also work offline, which can be a lifesaver when prototyping on the go or when cloud access is restricted.
  • Safe Experimentation: During prototyping, you might use sensitive data or try unconventional prompts. With Ollama, none of that data leaves your environment. You can experiment freely without worrying about NDA content hitting an external API or violating data protection rules. For European teams, this means you can ideate with real (or realistic) data early on, staying compliant with GDPR because the personal data isn’t being sent to third-party servers. It’s easier to convince stakeholders to prototype AI features when you can assure them “all data stays on our machines.” 🔒
  • Python Integration for Notebooks: If you like working in Jupyter or similar, the Python SDK (ollama package) allows you to call the local models directly in your notebooks or scripts. This means you can incorporate LLM calls into data science experiments or small apps seamlessly. For example, using the Python API:
  • from ollama import chat
    
    messages = [
        {"role": "user", "content": "Explain the benefits of local LLM deployment."}
    ]
    response = chat(model="llama3.3", messages=messages)
    print(response.message.content)

    In short, Ollama accelerates the prototyping phase by combining ease of use with the freedom of local operation. You get the agility of a startup hacker with the compliance of an enterprise – a rare combo that means more ideas can be tested and refined quickly, without red tape.

    Production Deployment with Ollama

    Beyond just tinkering, Ollama is built with production-readiness in mind. Developers and AI leaders can transition from prototype to production smoothly, thanks to features and practices that make Ollama suitable for real-world applications. Here’s how to leverage it in dev, staging, and prod:

    Tip: Treat your Ollama instance as another microservice in your architecture. Secure it (bind to localhost or internal network, use firewall rules), monitor it (logs, healthchecks), and scale it just like you would a database or web service. Many community integrations (like AnythingLLM or Ollama RAG Chatbot) already package Ollama in user-friendly ways for production deployments (with UIs, orchestration, etc.), so exploring those can jumpstart your deployment.

    Example: Building a Private Chatbot (Python)

    Let’s build a simple private chatbot using Ollama, step by step. Our goal is a conversational assistant that runs locally, ensuring that conversation logs and user inputs never leave our server. We’ll use Python to interact with Ollama’s API.

    1. Set up Ollama and a model: First, make sure the Ollama server is running. From a shell, you can start it with:

    ollama serve  # start the Ollama daemon (if not already running)

    Next, choose a model. For a chatbot, a good choice is an instruction-tuned model (e.g., Vicuna, Llama-2-chat, etc.). Let’s assume we want the Llama 3.3 chat model (hypothetical latest Llama). Pull it if not present:

    ollama pull llama3.3

    This downloads the model weights to your machine (once). Now our local “AI brain” is ready.

    2. Python API usage: Install the ollama Python package if you haven’t: pip install ollama. Then in Python:

    from ollama import ChatCompletion
    
    # Define the conversation messages
    conversation = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hi, I have a question about data privacy with AI."}
    ]
    
    # Create a chat completion request
    response = ChatCompletion.create(model="llama3.3", messages=conversation)
    print(response.choices[0].message.content)

    This sends the conversation to the local model and prints the assistant’s reply. The interface mimics OpenAI’s ChatCompletion API, making it very familiar. The response object may contain multiple choices (here just one), so we take choices[0]. Running this would yield an answer, for example: “Hello! I’m glad you’re asking about data privacy in AI. What would you like to know?” (The actual output depends on the model). All of this happened locally – no external API calls.

    3. Streaming responses (optional): In a chatbot, streaming the answer token-by-token gives a better UX. We can use Ollama’s streaming by setting stream=True and iterating:

    from ollama import chat  # alternative high-level function
    
    messages = [{"role": "user", "content": "Explain GDPR in simple terms."}]
    stream = chat(model="llama3.3", messages=messages, stream=True)
    for chunk in stream:
        print(chunk.message.content, end="", flush=True)

    This will print the assistant’s answer as it’s being generated (word by word or token by token). You could use this in a web app to update the UI progressively. If the model decides to use a tool or function call (and you allowed tools), those would appear in the chunk.message.tool_calls field, which you can intercept. For instance, if it called get_current_weather, you’d see that and could fetch the weather result, then continue streaming.

    4. Maintaining conversation state: A private chatbot usually needs memory of past messages. With Ollama, you manage that at the application level: keep a list of messages (as we did with a system and a user message). After each response, append it as {"role": "assistant", "content": <model_reply>} to the list for the next turn. This way, the context grows and the model “remembers” the conversation (within the context length). Ollama models often support fairly long contexts (4k, 16k tokens or more depending on the model variant), which is sufficient for a decent dialog before you might need to summarize or trim. The logging improvements in v0.8.0 can help here: memory usage info can hint at how close you are to context limits.

    5. Privacy considerations: Since this chatbot is local, users’ questions and the model’s answers never traverse an external network. For additional safety, ensure your Ollama server is not exposed publicly (bind it to localhost or a secure internal network). Also note that by default Ollama doesn’t phone home or share usage data, so you’re running a truly isolated service. Logging is local; if logs contain chat content (for debugging), handle them per your data policies (you might disable verbose logs in production if chats are sensitive). But crucially, no GDPR-sensitive data leaves your servers, fulfilling data residency requirements by design.

    Real-world tip: Some European companies pair an Ollama-based chatbot with an audit trail: since everything’s in-house, they log each query and response securely for compliance auditing. This would be impossible or risky with a third-party API, but with local AI it’s feasible to log interactions and prove no data was shared externally.

    By following these steps, we built a basic private chatbot. We can extend it with more features – for example, add tools (functions) for the model to call if needed (like a database lookup function), or integrate with a UI. Many community projects (e.g., ChatOllama or LibreChat frontends can provide a chat interface that connects to your Ollama backend. The result is a fully self-contained chatbot: users get the convenience of an AI assistant, while you maintain full control over data and costs.

    Example: Retrieval-Augmented Generation (RAG) Pipeline

    Retrieval-Augmented Generation is a popular technique to give LLMs access to a knowledge base (documents, FAQs, etc.) while keeping responses grounded. Let’s sketch how you can build a RAG pipeline with Ollama entirely locally:

    Scenario: Imagine you have a trove of internal documents (say company policies or product manuals) that you want an assistant to answer questions from. These documents must remain on-premise for confidentiality. We’ll use an embedding-based retrieval to find relevant text and then an Ollama-served model to generate an answer using that text.

    1. Index your documents: Use a text embedding model to vectorize your docs and store them in a local vector database (e.g., FAISS or similar). For example, using Python’s sentence-transformers:

    from sentence_transformers import SentenceTransformer
    import faiss
    import numpy as np
    
    # Prepare documents
    docs = [
        "Doc1: Our privacy policy states that user data is stored in EU data centers...",
        "Doc2: GDPR stands for General Data Protection Regulation, introduced in 2018...",
        # ... more docs
    ]
    # Compute embeddings (e.g., using a local model if available)
    embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    doc_embeddings = embedder.encode(docs)  # shape: (num_docs, dim)
    # Build FAISS index
    index = faiss.IndexFlatIP(doc_embeddings.shape[1])
    index.add(np.array(doc_embeddings, dtype='float32'))
    

    (In a real setup, you might use a larger, domain-specific embedding model, possibly even running through Ollama if it supports embedding modes. But here any local embedding works.)

    2. Retrieve relevant context: When a query comes in, embed the query and find similar docs:

    query = "Where is user data stored according to our policy?"
    q_emb = embedder.encode([query])
    D, I = index.search(np.array(q_emb, dtype='float32'), k=2)  # top-2 docs
    retrieved_texts = [docs[i] for i in I[0]]
    context = "\n".join(retrieved_texts)
    

    Now context contains the text snippets most likely to contain the answer (e.g., excerpts about data storage location and GDPR).

    3. Generate answer with Ollama: Construct a prompt that feeds the context to the model along with the question. For instance:

    prompt = f""" 
    Use the following context to answer the question.
    
    Context:
    \"\"\"
    {context}
    \"\"\"
    
    Question: {query}
    Answer:"""

    Then call the local model via Ollama:

    from ollama import generate
    result = generate(model="llama3.3", prompt=prompt)
    print(result)  # model's answer based on the provided context

    Here we used a hypothetical generate function (similar to ChatCompletion but for a single prompt) to ask the model. The model (which has no internet and only its training data) will rely on the provided context to answer. This way, even if the base model didn’t know specifics of your documents, it can draw from them.

    4. Iterate and refine: You might need to refine the prompt format or how much context to include (be mindful of token limits). Ollama supports models with long contexts (e.g., 16k or even 100k for some specialized models), but feeding too much irrelevant text can confuse the model. Empirically, it helps to include a clear instruction like “use the context above only” to prevent the model from hallucinating beyond given info.

    5. Privacy win: Note that the entire pipeline – embedding generation, vector search, and LLM inference – happens locally. No document data or queries are sent to an external service, preserving confidentiality. This is a massive advantage for GDPR and trade-secret scenarios. Even if you have thousands of documents, you can handle them on infrastructure under your control. Plus, you avoid hefty costs: doing the same with a paid API would incur significant token fees for indexing and querying, whereas here after the initial setup the per-query cost is negligible (just compute).

    6. Productionizing RAG: You can wrap this logic into a simple web service for internal use. The flow: user question -> REST endpoint -> Python code does retrieval -> calls Ollama -> returns answer. Tools like LangChain can simplify building this pipeline. In fact, LangChain provides an OllamaLLM integration that makes the call to the local model feel just like any other LLM in its framework. You could combine OllamaLLM with a FAISS vectorstore and RetrievalQA chain to implement the above in a few lines. For example, LangChain’s documentation shows using OllamaLLM(model="llama3.1") as the LLM for a QA chain and notes that it’s excellent for integrating local models into such pipelines.

    Real-world tip: Several community projects use Ollama for RAG-based chatbots. One example is Ollama RAG Chatbot, which allows chatting with multiple PDFs locally using Ollama as the backend. Such solutions typically handle document parsing, vector search, and then defer to Ollama for answer generation. Studying these can provide architecture inspiration (e.g., how to chunk documents, how to update context window as conversation continues). Importantly, RAG adds interpretability – you can log which documents were retrieved to answer a question, providing an audit trail of why the model said what it did (useful for compliance and debugging).

    In summary, building a RAG pipeline with Ollama involves combining local vector search with local LLM inference. It’s a potent pattern for enterprise AI: you get up-to-date, factual answers from your private data, boosted by a language model – all under your control and within GDPR bounds. 💡

    Fine-Tuning Workflows with Ollama

    Pre-trained models are great, but sometimes you need to fine-tune an LLM on your proprietary data or for a specific style. Ollama can serve fine-tuned models efficiently, and its Modelfile system even helps in applying fine-tuned LoRA adapters with ease. Let’s break down how you can incorporate fine-tuning into your Ollama workflow:

    FROM llama3.2
    ADAPTER ./my-tune.gguf

    This tells Ollama: take the base llama3.2 model and apply this LoRA adapter on load. You can add other lines too (SYSTEM prompts, parameters) as needed, but the critical part is the ADAPTER line pointing to your fine-tuned weights. With this Modelfile, run:

    ollama create my-custom-model -f Modelfile
  • Ollama will load the base model, apply the LoRA, and save this new composed model internally (so you can refer to it as my-custom-model). This feature allows quick deployment of fine-tunes without merging weights manually. It’s worth noting Ollama expects the adapter in a supported format (GGUF/GGML) that matches the base model architecture. Once created, you use ollama run my-custom-model or via API model: "my-custom-model" to get responses from the fine-tuned model – as simple as using any stock model.
  • Example Use Case: Suppose you fine-tuned a model on your company’s support chat transcripts so it learns your product specifics and tone. After creating it in Ollama, you could run a private Q&A chatbot that is far more tailored than a generic model. And since fine-tuning might contain proprietary data, serving it locally ensures that fine-tuned knowledge stays in-house. You’re not uploading your domain data into someone else’s platform – the model is yours to keep.
  • Integration with Pipelines: In Python, using a fine-tuned model is no different than a base model. If you created my-custom-model as above, you just call:
  • response = ChatCompletion.create(model="my-custom-model", messages=[...])

    and get results. This means you can slot fine-tuned models into your RAG pipeline or chatbot with one config change. For instance, in LangChain you’d do OllamaLLM(model="my-custom-model") to use it.

    Real-world tip: Keep your base models and adapters organized and versioned. Ollama’s ollama list will show all models you have. A naming convention like model-company-v1, model-company-v2 can help track iterations. Because Ollama only downloads the diff when updating a model, maintaining updated versions is bandwidth-efficient. Also, remember you can distribute your fine-tuned models within your org easily – just share the model file or the Modelfile recipe. Since it’s all local, even sharing stays internal.

    In summary, fine-tuning with Ollama involves external training but very easy integration. The Modelfile’s ADAPTER feature acts like a plug-and-play for custom model weights. This empowers you to customize models for your needs and deploy them privately, combining the benefits of open-source model flexibility with enterprise-grade confidentiality.

    Comparing Ollama to Other LLM Solutions

    Ollama isn’t the only player in local or private LLM deployment. Let’s compare it with a few notable alternatives, both local frameworks and cloud services, focusing on privacy, cost, and flexibility:

    Ollama vs LM Studio

    LM Studio is another popular way to run local LLMs. It provides an all-in-one desktop GUI, whereas Ollama is primarily CLI/API-based. Key differences:

    Verdict: If your priority is ease-of-use via GUI and quick local testing, LM Studio is a great choice. But for privacy-focused deployment, automation, and open-source flexibility, Ollama is the winner in a production context. In practice, some users even use both: LM Studio to find and test models, then switch to Ollama when integrating into an application or backend service.

    Ollama vs LMQL

    LMQL (Language Model Query Language) is a different beast – it’s a specialized programming language for crafting constrained LLM prompts and decoding strategies. The comparison is a bit apples-to-oranges:

    Verdict: Ollama and LMQL solve different problems. They can complement: for example, use LMQL to prototype a constrained prompt flow, and use Ollama as the model runtime for it, keeping everything local for privacy. Organizations focused on GDPR would still lean on Ollama for actual model execution, possibly with LMQL on top for logic. If you don’t need LMQL’s specific features, you’ll find Ollama alone sufficient and simpler.

    Ollama vs OpenDevin

    OpenDevin is an open-source platform aimed at creating an “AI software engineer” – essentially an autonomous coding assistant that can build entire apps. It’s inspired by a closed tool named Devin. Comparing with Ollama:

    Verdict: Ollama vs OpenDevin is not either-or. If you need an agentic coding assistant, OpenDevin is a project to consider, and you’d probably use Ollama to supply the LLM it needs (ensuring privacy). If you only need a conversational or Q&A assistant (not a full agent solving tasks), you might not need OpenDevin’s complexity at all – a simpler Ollama-based solution could suffice. So, Ollama remains the core building block; specialized frameworks like OpenDevin sit at a higher layer.

    Ollama vs Hosted APIs (OpenAI, Anthropic Claude, Mistral AI)

    This is where privacy and cost considerations are stark:

    Verdict: Choosing between Ollama (local) and hosted APIs often boils down to priorities. If privacy, control, and long-term cost savings are critical – which is frequently the case in GDPR-sensitive and enterprise environments – Ollama or similar local solutions are superior. If the absolute bleeding-edge accuracy is needed and you’re willing to navigate data compliance with a third party, you might consider an API for those cases, but eyes wide open to the risks. Many companies are surprised how capable modern open models are when fine-tuned to their domain; the gap to the proprietary giants has narrowed. And the peace of mind from owning your AI stack is invaluable. As one blog succinctly noted, self-hosting models gives full control over infrastructure, costs, and data security, with no dependence on third-party AI services. This control is increasingly not just a technical preference, but a governance requirement.

    Conclusion

    Ollama, especially in its latest versions (v0.8.0 and v0.9.0), emerges as a compelling solution for teams that need private, flexible, and production-ready AI. It marries the convenience of a unified framework (easy model management, one-command deployments, simple APIs) with the assurances of local deployment (data stays in-house, compliance is simplified, and you’re not locked into any vendor). We’ve seen how its new features like streaming tool support and thinking mode enhance the development experience – making interactions more real-time and debugging more transparent. We’ve walked through examples from building a private chatbot, to a RAG pipeline, to handling fine-tuned models – illustrating that Ollama isn’t just a toy, but a tool ready for serious applications.

    For AI leaders, the message is clear: you can accelerate innovation (through rapid prototyping) and uphold the highest standards of data privacy by leveraging local LLM frameworks like Ollama. The usual trade-off between agility and compliance fades away – you get both. In Europe, where GDPR enforcement is strict, this approach can be the difference between having AI features or not (since many cloud-based AI ideas get nixed by compliance early). With Ollama, you have a path forward: deploy AI services that are GDPR-friendly by design and scalable as your usage grows.

    In comparing Ollama to other solutions, we’ve noted that each has its place, but Ollama’s blend of developer-centric design and open-source ethos gives it an edge for those building AI into products and platforms. Whether you’re running it on a developer’s laptop for a quick prototype or on a secure server cluster serving thousands of requests, the experience is consistent and reliable.

    Final tips for success: Keep an eye on Ollama’s release log (as we did) – the pace of improvement is rapid. New model support and features are added frequently, driven by an active community. Experiment with different models to find the best fit (quality vs speed) for your use case. Use Modelfiles to codify your customizations so they’re reproducible. And don’t hesitate to tap into the community integrations – many clever tools (from UIs to orchestration plugins) can complement Ollama and save you time.

    By adopting a local LLM solution like Ollama, you’re future-proofing your AI strategy: you gain full control, flexibility to adapt, and the confidence that user data remains safe. It’s a paradigm shift akin to the early days of hosting your own servers vs relying on external services – for those who need it, the control is empowering.

    So go ahead – pull that model, fire up Ollama, and build something amazing, all while keeping your data right where it belongs. Happy coding, and happy prompting!

    Sources:

    Tega AdeyemiMay 30, 2025