I Benchmarked 3 Local AI Models on My Laptop. The Results Were Surprising

I built a privacy-first AI assistant that runs entirely offline — no OpenAI, no cloud, no data leaving your machine. Then I wired up a benchmarking suite to actually measure which local model performs best on my hardware. Three models tested: llama3.2:3b, phi3:mini, mistral:7b. Measured across 30 prompts covering factual, reasoning, code generation, and structured output tasks. What's inside: ✅ Ollama local inference — runs llama3.2:3b, phi3:mini, mistral:7b entirely offline ✅ FastAPI wrapper — /query, /benchmark, /switch, /models endpoints ✅ JSON schema validation — structured output with 1-retry correction loop ✅ Benchmarking suite — P50/P95/P99 latency, tokens/sec, memory via psutil ✅ Multi-model comparison — same 30 prompts, 3 models, automated report ✅ Browser UI — self-hosted chat interface at localhost:8000/ui ✅ Docker Compose — Ollama + FastAPI in one command The benchmark exposes what nobody tells you: Llama 3.2 3B: 42.3 tok/s, P95 at 3.8s — fastest, misses P95 under 3s SLA Phi3 Mini: 4.7 tok/s on CPU — slowest by far Mistral 7B: best quality, highest memory (14 GB) Pick the wrong model and you get 29-second latency on a simple question. 🔗 RESOURCES: GitHub Code: https://github.com/ThinkWithOps/02-lo... 🛠️ Tech Stack: FastAPI — REST API + browser UI Ollama — local LLM runtime (no API keys) llama3.2:3b / phi3:mini / mistral:7b — models tested Pydantic — JSON schema enforcement + retry psutil — memory profiling per inference call NumPy — P50/P95/P99 aggregation Docker Compose — container orchestration #LocalAI #Ollama #LLM #AIEngineering #PrivacyFirst