Engineering7 min read

How to Run GPT-Level AI Locally: A No-BS Guide to GPT-OSS 20B & 120B

Tired of rate limits and surprise bills? Run GPT-OSS locally. This 2025 guide shows you how to ditch APIs and ship fast with real use-cases.

Tega Adeyemi
Tega Adeyemi
How to Run GPT-Level AI Locally: A No-BS Guide to GPT-OSS 20B & 120B

Preview: Two truly open-source GPT models just made “local-first” AI practical. We’ll show you how to run them on your machine, wire them into your automation stack (think n8n, agents, tools), and ship real use-cases—minus API limits, vendor lock-in, and surprise bills.

Why This Is a Big Deal (for devs and AI leaders)

We’ve all been there: an LLM that’s great… until rate limits, rising costs, or privacy constraints say “nope.” The GPT-OSS models change the equation:

Both use a Mixture-of-Experts (MoE) design: only a subset of parameters activates per token (~3.6B active for 20B; ~5.1B active for 120B). Translation: big-model quality with small-model efficiency. Add a 128k token context window, and you can reason over long docs, logs, and codebases without frantic chunking.

What this unlocks

TL;DR Setup Paths

Pick one path and you’ll be testing in minutes.

Path A — Run Locally via Ollama (Recommended)

  1. Install Docker Desktop.
  2. Install Ollama.
  3. Pull a model:
ollama pull gpt-oss:20b
# Optional heavy mode:
# ollama pull gpt-oss:120b
  1. Run it:
ollama run gpt-oss:20b
  1. Quick native REST test (non-OpenAI API):
curl http://localhost:11434/api/generate -d '{
  "model": "gpt-oss:20b",
  "prompt": "Give me 3 bullet points on why local LLMs matter.",
  "stream": false
}'

Path B — Hosted, Still “Open”: OpenRouter

If your laptop wheezes at 120B:

  1. Create an OpenRouter account and API key.
  2. Point your app to the OpenRouter endpoint.
  3. Select gpt-oss:20b or gpt-oss:120b from their catalog.
    Great for prototyping large-model behavior without buying more GPUs.

Wiring Into Your Automation Stack (n8n Example)

We’ll use n8n (swap with your orchestrator of choice: Temporal, Airflow, LangChain agents, etc.).

1) Bring up the stack

Use Docker Compose to spin up n8n + Postgres:

# docker-compose.yml
services:
  n8n:
    image: n8nio/n8n:latest
    ports: ["5678:5678"]
    environment:
      - N8N_HOST=localhost
      - N8N_PORT=5678
    depends_on: [db]
    volumes:
      - n8n_data:/home/node/.n8n
  db:
    image: postgres:15
    environment:
      POSTGRES_USER: n8n
      POSTGRES_PASSWORD: n8n
      POSTGRES_DB: n8n
    volumes:
      - n8n_db:/var/lib/postgresql/data
volumes:
  n8n_data:
  n8n_db:
docker compose up -d

Open http://localhost:5678.

2) Connect the local model

You have two clean options:

Option A — n8n’s native Ollama nodes

Option B — OpenAI-compatible credentials in n8n

Quick sanity tip: If requests fail, double-check whether you used the native Ollama API (no /v1) or the OpenAI-compatible API (with /v1), and that your model name uses the tag format (gpt-oss:20b), not a dashed name.

Production-Ready “Hello, Value” Use-Cases

Use-Case 1 — “Research → Draft → Review” Content Chain

Why: High leverage, high frequency, low risk.

Workflow

  1. HTTP Request → fetch docs or URLs
  2. Summarize (20B) → chunk & TL;DR
  3. Synthesize (120B for quality) → draft post/email/report
  4. Policy Check (regex/lint + a second 20B pass)
  5. Publish (GitHub PR, Google Doc, or CMS API)

n8n System Prompt (synthesis stage)

You are a precise technical writer. Combine the provided notes into a clear, factual draft.
- Keep claims grounded in the input.
- Use section headers and bullet lists.
- Include a short "Key Takeaways" box.
Return valid Markdown only.

Use-Case 2 — CRM Assistant: Look Up → Reason → Act

Why: Move from “chat” to “ops.”

Sketch (TypeScript-ish pseudo inside an n8n Function node)
(Pseudocode for clarity; in n8n you’ll typically wire Sheets + Gmail nodes rather than call helpers directly.)

const person = await sheets.lookup("Contacts", { name: "Ada Lovelace" });

const draft = await ai.chat({
  // Route heavy reasoning to 120B; use 20B for shorter summarization steps
  model: "gpt-oss:120b",
  system: "You write concise, warm follow-ups. 120 words max.",
  messages: [
    { role: "user", content: `Write a follow-up referencing her last demo on ${person.lastDemoTopic}.` }
  ],
  temperature: 0.2
});

// Validate content before acting (guardrail)
assert(!draft.includes("UNSUBSCRIBE") && draft.length < 1200);

await gmail.send({
  to: person.email,
  subject: "Next steps",
  html: draft
});

return { ok: true };

Reality check: In tool-call-heavy flows we’ve seen 20B occasionally produce duplicate “act” steps if you let it plan and execute in one shot. Two mitigations:

Use-Case 3 — Long-Log Diagnostics (128k Context FTW)

Drop big logs in, ask for patterns:

cat /var/log/app/*.log | ollama run gpt-oss:120b \
"Identify top 5 recurring errors, likely root causes, and a prioritized fix plan."

Practical Client Code (OpenAI SDK → Local Ollama, OpenAI-Compat)

# pip install openai==1.* 
from openai import OpenAI

# OpenAI-compatible endpoint (experimental in Ollama):
client = OpenAI(base_url="http://localhost:11434/v1/", api_key="ollama")

resp = client.chat.completions.create(
    model="gpt-oss:20b",   # use the Ollama tag
    messages=[
        {"role": "system", "content": "You are a terse senior engineer."},
        {"role": "user", "content": "Give me a 3-step plan to migrate cron jobs to Airflow."}
    ],
    temperature=0.2,
)
print(resp.choices[0].message.content)

Hosted fallback (same code path):

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key="<OPENROUTER_API_KEY>"
)

Implementation Tips (Hard-Won)

  1. Model Routing (save $ and latency)
    • 20B for: routing, summaries, metadata extraction
    • 120B for: code generation, long-context reasoning, tool-heavy plans
  2. Guardrails First, Not Later
    • Add JSON schema outputs for plan/act separation
    • Validate before acting (email sends, API mutations)
  3. Prompt Contracts
    • Treat prompts like API contracts; version them
    • Log inputs/outputs for replay (and prod bugs)
  4. Context Diet
    • Prefer short system prompts + tight examples to reduce drift
    • Use retrieval; don’t paste the internet
  5. GPU Reality
    • 20B: comfortable on ~16–24GB VRAM with quantization (throughput varies)
    • 120B: plan for serious GPU (e.g., 80GB class) or use OpenRouter for that step
    • Quantization helps; test Q4_K_M vs Q5_K_M trade-offs
  6. Throughput > Single-Query Speed
    • Batch where possible
    • Keep the pipe full; use streaming for UX
  7. Test with “Adversarial” Fixtures
    • Similar contacts with different domains
    • Emails that should never be sent (e.g., unsubscribed)
    • Logs with overlapping error signatures
Comparisons — What to Use When
Criterion GPT-OSS:20B GPT-OSS:120B Mixtral-8x7B Llama-3-8B/70B vLLM (server) LM Studio
License Open Open Open Open N/A (serving) App
Strength Fast routing, summaries Deep reasoning, coding, long context Strong MoE generalist Broad ecosystem & finetunes High-throughput inference Zero-Dev desktop runner
Context 128k 128k 32–64k typical 8–128k variants Depends on backend Depends on model
Best Fit Agents glue, RAG extract Agents that act, code & plans Middle-ground quality/perf Ecosystem & staffing ease Serving fleets Local testing/demo
Trade-off May stumble on complex tool calls Heavier infra needs Balanced behavior Varies by quant/finetune Infra to own Fewer server knobs

Our rule of thumb: 20B routes and summarizes; 120B thinks and acts. If you’re infra-constrained, Mixtral is a solid middle ground. If you want the biggest ecosystem, Llama-3 variants are easy to staff for.

Security & Privacy Checklist (Pin This)

Troubleshooting (Fast Fixes)

Key Takeaways (Tape these above your desk)

A Parting Nudge

We’re collectively moving from “LLMs as chat” to LLMs as systems. The teams who win won’t just prompt better, they’ll wire models into tools, data, and decisions with discipline.

If you’ve been waiting for the moment to go local, this is it. Spin it up. Point it at something useful. Watch latency drop, bills calm down, and velocity jump.

See you on the other side, where your AI runs on your terms.

Tega AdeyemiOctober 13, 2025