Part 1: Ollama Overview and Getting Started

Ollama is an open-source tool for running large language models (LLMs) locally on your own hardware . In simple terms, it provides a convenient way to download and run advanced AI models (like LLaMA, Mistral, etc.) directly on your machine instead of relying on cloud APIs. This approach is appealing for developers, researchers, and businesses that care about data control and privacy, since no data needs to leave the local environment . By running models offline, you maintain full ownership of your data and eliminate potential security risks of sending sensitive information to third-party servers. Another benefit of local execution is reduced latency – responses can be faster and more reliable without network delays . Overall, Ollama’s purpose is to make it easy to experiment with and use advanced language models on personal or on-premise systems, giving you more control over both the models and your data.
Beyond privacy, Ollama also helps avoid ongoing costs associated with cloud AI services. You don’t incur per-query fees because the inference runs on your own hardware . This can lead to significant savings for heavy usage. Ollama builds upon the efficient llama.cpp backend, meaning it’s optimized to run LLMs with modest resource requirements by leveraging techniques like quantization for smaller memory footprints . In summary, Ollama’s key purpose is to simplify local LLM deployment, enabling a wide range of users to run advanced AI models on laptops, desktops, or servers without needing extensive AI infrastructure.
Benefits of Using Ollama
Using Ollama comes with several important benefits:
• Data Privacy and Security: All computations happen locally, so your prompts and the model’s outputs are not sent to external servers . This is crucial for sensitive applications (medical, legal, corporate data, etc.) where sharing data with a cloud service is undesirable. You maintain full data ownership and compliance with any data governance requirements by keeping everything in-house .
• Offline Capability: Ollama does not require an internet connection once models are downloaded. This means you can use advanced AI anywhere, even in air-gapped or offline environments, which is not possible with cloud-based AI. It also adds robustness—your AI functionality isn’t subject to internet outages or API downtime.
• Low Latency & Performance: By running on local hardware, inference latency can be lower since there’s no network overhead . Especially for iterative interactions or real-time applications, removing the round-trip to a server can make responses feel instantaneous. Performance can further be improved if you have a strong GPU, as Ollama supports acceleration on NVIDIA and (in preview) AMD GPUs to speed up model processing .
• Cost Savings: Once you’ve set up your hardware, using Ollama is cost-effective for heavy usage. You’re not paying per query or token. If you already have capable hardware (e.g. a machine with sufficient RAM or a good GPU), running open-source models locally can be much cheaper in the long run than paying for API calls to a proprietary model . Many research groups find it cost-effective to use Ollama on their own servers rather than constantly hitting paid APIs .
• Control and Customization: You have fine-grained control over the models and their versions. Ollama lets you easily download, update, or remove models via simple commands . You can even maintain multiple versions of a model and switch or revert as needed, which is useful for testing and comparatives in research or production scenarios . This level of control is hard to achieve with remote services. Moreover, because the tool is open-source, advanced users can inspect or modify how the model is run, integrate new models, or adjust parameters for their specific needs.
• Flexibility and Integration: Ollama supports use through a command-line interface (CLI) and also works with third-party GUI tools . This means both terminal enthusiasts and those who prefer graphical interfaces can benefit. It’s also cross-platform – it runs on macOS, Linux, and Windows (Windows support is currently in preview) . This multi-platform support ensures you can integrate Ollama into various environments, from a developer’s laptop to a headless Linux server or even a Windows workstation.
In essence, Ollama provides a balance of convenience and control: convenience in that it simplifies the complex task of running LLMs locally, and control in that you decide how and where your models operate.
Installation and Setup Guide
Getting started with Ollama is straightforward. The installation process varies slightly by operating system, but the project provides easy installers for each:
• macOS: You can download a macOS installer from the official website or use Homebrew. For example, on macOS you may run brew install ollama to install it via Homebrew . The downloaded app/installer will set up Ollama on your system (requires macOS 11 Big Sur or later).
• Linux: For Linux systems, Ollama offers a one-line shell installation. Open your terminal and run:
curl -fsSL https://ollama.com/install.sh | sh
This script will automatically download the latest Ollama build and set it up on your machine . Alternatively, you can follow manual install instructions (available on the official GitHub) if you prefer to see the steps or customize the installation . After installation, the ollama command will be available in your terminal.
• Windows: Windows users can download the installer (an .exe or .msi) from the official site . Running the installer will install Ollama (Windows 10 or later is required). Note that Windows support is still in preview , which means some features may be experimental. Ensure you have the latest updates and possibly enable any necessary virtualization or WSL2 if recommended by the docs.
System Requirements: Running LLMs can be resource-intensive. As a general guideline, you should have at least 8 GB of RAM for smaller models (7B parameters), around 16 GB for mid-sized models (13B), and 32 GB or more for larger models (30B+), especially if running on CPU . Having a discrete GPU with ample VRAM can greatly enhance performance , but Ollama can work on CPU-only systems for smaller models or quantized versions. Make sure to check the official GitHub or documentation for any updated hardware recommendations (for example, specific instructions for enabling GPU support on your platform).
After installing, it’s a good idea to verify everything is set up correctly. Open a terminal (or command prompt on Windows) and run:
ollama --version
This should output the version of Ollama installed, confirming that the command-line tool is accessible . You can also run ollama with no arguments to see the help text or usage info, which indicates the installation is working.
Starting the Ollama Service: In many cases, once installed, Ollama runs as a background service (daemon) or can be started as one. On macOS, if installed via Homebrew, you might use brew services start ollama to run it as a background service . On other platforms, you can manually start the service by running:
ollama serve
The ollama serve command launches the Ollama server process that powers the API and model loading in the background . You might see a message that the server has started. It’s recommended to run this when you plan to interact with Ollama programmatically or via other applications. For basic CLI usage, you can also directly invoke commands like ollama run (which will automatically start the service if needed).
Once installation is complete and the service is running, you’re ready to download your first model and generate text!
First Steps: Downloading and Running a Model
With Ollama installed, let’s walk through the first steps of using it. Typically, your journey will be: pull a model, then run the model.
1. Pulling a Model:
Ollama maintains a library of supported models that you can download on demand. To see available models, you can browse the official model registry on their website or simply try pulling a known model. For instance, to download the Llama 2 model (say a 7B variant) you would run in your terminal:
ollama pull llama2
This command fetches the model weights and necessary files onto your system . The name after pull corresponds to a model identifier in Ollama’s registry (for example, “llama2” might map to a default Llama 2 model, or you could specify llama2:7b if needed). The download can be sizeable (several GBs), so it may take some time on first pull. You only need to do this once per model; subsequent uses won’t re-download the model.
Pro tip: If you’re unsure which model to start with, the Ollama documentation and website list many models with descriptions of their capabilities . For general purposes, you might try a smaller model like Mistral (which is good for coding and tasks) or Llama 2 7B for a ChatGPT-like experience.
2. Running a Model (Interactive Mode):
Once the model is downloaded, you can run it. The simplest way is interactive mode, which drops you into a chat-like interface in the terminal. Execute:
ollama run llama2
(Replace “llama2” with whatever model you pulled, e.g. ollama run mistral.)
When you run this command, Ollama will load the model into memory and give you a prompt (usually indicated by >>> in the terminal) where you can start typing questions or instructions. For example:
$ ollama run llama2
>>> Hello, how do you work?
After you enter a prompt and press Enter, the model will think for a moment and then print out a response. You can then continue the conversation by typing another question. This REPL (read-eval-print loop) style interaction lets you have a back-and-forth conversation with the AI, much like chatting with ChatGPT but running locally in your terminal . The model will remember the context of the conversation within this session, so you can ask follow-ups or clarifications.
Here’s an example of an interactive session (user input preceded by >>>, model response after):
>>> What is the capital of France?
The capital of France is Paris.
>>> Why is it famous?
Paris is famous for its rich history, architecture, art, fashion, and cuisine.
It is home to landmarks like the Eiffel Tower and Louvre Museum, making it a popular tourist destination.
>>> /bye
In this dialog:
• The user first asks a question, the model answers “Paris.”
• The user asks a follow-up question, and the model gives a detailed answer.
• The session is ended by typing the special command /bye.
In Ollama’s interactive mode, commands that start with a slash / are special instructions (not sent to the model). For example, /bye will exit the session and return you to the normal shell prompt . Another useful command is /help or /? which can show help on available commands if that’s supported. Once you exit, the model is unloaded from memory after a short time of inactivity (by default), freeing up resources.
3. Running a Model (Single Prompt):
Alternatively, you can run a model with a one-time prompt and get the output immediately, without going into an interactive chat. This is useful for scripting and quick queries. Simply include the prompt in quotes after the model name. For example:
ollama run llama2 "Explain the basics of machine learning."
When you run this, Ollama will load the model, process the prompt, and print the answer to the console, then exit. This non-interactive usage is great for automation or if you want to redirect the output to a file. In fact, you can do things like:
ollama run llama2 "Write a haiku about autumn." > autumn_haiku.txt
This will save the model’s generated haiku into the file autumn_haiku.txt for later use. The CLI makes it easy to incorporate such commands in shell scripts or pipelines.
Example: Summarizing a text file could be as simple as:
ollama run llama2 "Summarize the following text:" < report.txt > summary.txt
Here we use shell redirection to feed in a long file (report.txt) as input after a prompt, and output the summary to summary.txt. These kinds of one-liners highlight how Ollama can be combined with standard shell tools to accomplish useful tasks .
4. Verifying Model Operation:
Once you get a response from the model – either via interactive chat or a one-time run – you’ve successfully used Ollama. If the model responds sensibly to your queries, everything is working. If you encounter errors (like out-of-memory or model not found), you may need to check that your system meets requirements or that the model name is correct and fully downloaded. Use ollama list to see what models are installed locally , and ollama pull <model> again if something went wrong with the download.
Step-by-Step Beginner Examples
To solidify understanding, let’s go through a simple beginner workflow with Ollama, step by step:
Example Scenario: You want to set up a local AI assistant that can answer questions and help with brainstorming, similar to a personal ChatGPT, but running on your own PC.
1. Install Ollama: Follow the installation steps for your OS (as described earlier). After installation, ensure the Ollama service is running. For many, simply launching an ollama run command will automatically start the necessary background service. If not, run ollama serve in a terminal to start the daemon that will handle requests .
2. Download a model: Since you need a general-purpose assistant, a good choice is a variant of the LLaMA model or another chat-tuned model. For example, run:
ollama pull llama2
This downloads the LLaMA 2 model (if the default is a chat-tuned version, it’ll work well for Q&A and discussion). Wait for the download to complete (watch the progress in the terminal). Tip: If you have limited RAM, you might opt for a smaller model like mistral (7B) or other 7-13B models rather than a 70B one.
3. Run an interactive session: Now launch the model:
ollama run llama2
You should see a >>> prompt. Start by greeting the model or asking a simple question:
>>> Hi there!
The model will respond with something like “Hello! How can I help you today?” (The exact output may vary). You can then ask it to perform tasks or answer questions. For instance:
>>> Can you tell me a fun fact about Python programming?
The model might respond with a fun fact. You can continue the dialogue as needed. This example shows how you effectively have a local chat assistant at your disposal. The quality of answers will depend on the model’s capabilities (LLaMA 2 7B might give decent answers; larger models or fine-tuned variants will do better).
4. Try a one-shot prompt: Open a new terminal (or exit the interactive session by typing /bye) and run a direct query:
ollama run llama2 "List three potential applications of AI in healthcare."
The model will output a numbered list or a paragraph with some applications (e.g. drug discovery, personalized treatment, medical imaging analysis). This demonstrates usage in non-interactive mode, suitable for getting quick answers or using Ollama within scripts.
5. Saving outputs to a file: Suppose you want the model to generate a markdown file containing an outline for a blog post. You can run:
ollama run llama2 "Generate an outline for a blog post about the benefits of local AI model serving." > outline.md
After a few seconds, check the outline.md file – it should contain the model’s generated outline. This is a simple way to bootstrap content or documentation using AI assistance.
6. Managing models: You can download multiple models. For example, ollama pull mistral to get the Mistral model, or other models listed in the Ollama library. Use ollama list to see all models you have on disk. If you need to free space, ollama rm <model> will remove a model. You can always re-pull it later if needed . If a model is currently running (loaded), ollama stop <model> will unload it from memory .
7. Exiting and cleanup: When done with your Ollama session, simply exiting the interactive mode or closing your terminal is fine. If you started the service manually (with ollama serve), you can stop it by pressing Ctrl+C in that terminal or by running ollama stop <model> for each model, though the server will automatically unload models after a period of inactivity. There’s usually no harm in leaving the Ollama service running in the background; it will typically consume minimal resources when no models are loaded.
These beginner steps show that with just a few commands, you can set up a local AI model and interact with it. Ollama abstracts away a lot of complexity (like setting up the model environment, managing tokenization, etc.), allowing you to focus on what you want to do with the model. As you become comfortable with these basics, you can start exploring more advanced features and integrations, which we will cover in the next articles.
Final Thoughts
Ollama makes getting started with local AI models remarkably accessible. In this first article, we introduced what Ollama is and walked through installing it and running your first models. The key takeaway is that you don’t need to be a machine learning expert to harness powerful LLMs on your own computer – Ollama handles the heavy lifting of model management for you. The benefits in privacy, cost, and control are significant for anyone concerned about sending data to the cloud or paying per API call.
By experimenting with some basic prompts and interactions, you’ve seen how to use Ollama’s CLI to chat with a model or generate text outputs for various needs. At this point, you have a functional local AI setup. In the next article, we’ll build on this foundation and explore more advanced use cases and integrations. There is a rich ecosystem around Ollama, including ways to use it in GUI interfaces and to integrate it with other tools and workflows. With the basics under your belt, you’re ready to delve deeper into what else Ollama can do to empower your AI projects.
Cohorte Team
March 17, 2025