I have a private AI running in my closet.

Not a cloud service. Not an API with a credit card attached. A physical machine in my apartment running open-source language models that answer whatever I ask, with zero content filtering and zero data ever leaving my network.

Let me explain why that matters and how I set it up.

Why Self-Hosted AI

Every cloud AI service has guardrails. Some of those guardrails make sense. Some of them don’t. But the bigger issue for me isn’t the content filters… it’s the data.

When you type a question into ChatGPT or Claude or Gemini, that prompt goes to a server you don’t control. It might be used for training. It might be stored. It might be subject to a subpoena. You don’t really know, and you can’t really verify.

I wanted an AI I could ask ANYTHING. Security research questions. Creative writing without content warnings. Technical explorations that trigger refusals in commercial models. And I wanted to know, with certainty, that my conversations stay on my hardware.

So I built one.

The Hardware

My homelab server is a refurbished Lenovo ThinkCentre M710q. Intel i5-7500T, 16GB RAM, fits in a shoebox. I’ve written about this machine before… it runs 24+ Docker containers and barely notices.

The AI models live on a Samsung T7 Shield 2TB SSD plugged into the USB port. The T7 is rugged, bus-powered, and portable. I can unplug it, take it with me, and plug it into any machine with Docker installed.

The Stack

Two containers. That’s it.

Ollama runs the actual language models. It’s an open-source LLM runtime that manages model weights, handles inference, and exposes an API. Think of it as Docker for AI models.

Open WebUI is the browser frontend. Clean chat interface, conversation history, model switching. Looks and feels like ChatGPT but it’s talking to your local Ollama instance.

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - 11434:11434
    volumes:
      - /mnt/t7-intel/ollama-models:/root/.ollama

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - 3080:8080
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434

docker compose up -d and you’re running.

The Models

Here’s what I pulled:

ModelSizeWhat It Does
dolphin-llama3:8b5GBUncensored general assistant. Will answer anything.
dolphin-mistral:7b4GBUncensored, faster responses
deepseek-r1:8b5GBDeep reasoning, good for study help
qwen2.5-coder:7b4.7GBCode assistance
llama3.2:1b1.3GBUltra-fast for quick questions
nomic-embed-text274MBEmbeddings for search (RAG)

The Dolphin models are the main event. They’re based on Meta’s Llama and Mistral but fine-tuned with the content filters removed. They’ll discuss anything without refusals, warnings, or disclaimers.

Total storage: about 20GB. On a 2TB drive. Plenty of room.

The Speed Problem (and the Fix)

Here’s the honest part: inference on a CPU-only i5-7500T is slow. We’re talking 2-5 tokens per second for the 8B models. A paragraph takes 30 seconds. It works, but it’s not instant.

The 1B model (llama3.2:1b) runs at maybe 10-15 tokens per second. Good enough for quick questions.

The real fix is a GPU. An NVIDIA T600 (4GB GDDR6) goes for about $60-80 used on eBay, fits in the low-profile PCIe slot, and would push inference to 15-25 tokens per second. That’s on the shopping list.

Alternatively, I can run Ollama on my MacBook where the M4 chip does 30+ tokens per second. But then it’s not always-on and not truly private server infrastructure.

For now, slow and private beats fast and cloud. I use the 1B model for quick stuff and save the 8B models for when I’m willing to wait.

Access From Anywhere

My homelab runs Tailscale, a mesh VPN that connects all my devices. So I can access Open WebUI from:

  • My Mac at home: http://100.92.38.114:3080
  • My phone on cellular: same URL via Tailscale
  • Any device on my network

No port forwarding. No exposing anything to the internet. Tailscale handles the encrypted tunnel.

I also added a “Wingman” link to my Grand Central dashboard sidebar, so getting to the AI chat is one click from my main workspace.

What I Use It For

Honestly? Everything I can’t or don’t want to ask a cloud service.

Security research questions that would get flagged. Creative writing without content warnings interrupting the flow. Brainstorming without the feeling that someone’s reading over my shoulder. Processing personal documents through RAG without uploading them to someone else’s server.

And sometimes just the peace of mind of knowing the conversation is mine. It lives on a drive I can hold in my hand. Nobody else sees it. Ever.

The Bigger Vision

The T7 drive isn’t just an AI server. It’s becoming a portable offline knowledge hub. Offline Wikipedia via Kiwix, survival guides, recipe database, programming references, entertainment. The idea is that if the internet goes down (hurricane season in Tampa is no joke), I can plug this drive into any computer and have a complete knowledge server.

AI that works without internet. Wikipedia that works without internet. Recipes, maps, reference docs. All on one bus-powered SSD.

But that’s a story for another post.

How to Do This Yourself

  1. Get any computer with Docker installed
  2. docker compose up -d with the Ollama + Open WebUI stack
  3. docker exec ollama ollama pull dolphin-llama3:8b
  4. Open http://localhost:3080 in your browser
  5. Create an admin account
  6. Start chatting

Total setup time: about 20 minutes, most of which is waiting for model downloads. Total cost if you already have a computer: $0.

Your own private, uncensored AI. No cloud. No subscriptions. No data leaving your house.


I run Ollama on a homelab I built for under $500. The models are open-source and free. The only ongoing cost is electricity, which is negligible for a 35-watt mini PC.