Qwen 3.6 27B on a 5070 Ti: my full local AI agent build

A complete walkthrough of my personal AI assistant. the model, the agent loop, the chat interface, and the honest benchmark numbers. Runs on a single RTX 5070 Ti (16GB VRAM) on Kubuntu. Nothing leaves my network. No sub, no API, no rate limits. The stack: llama.cpp for inference Qwen 3.6 27B (HauhauCS uncensored fine-tune, Q3_K_P quant) nanobot for the agent loop Telegram as the chat channel whisper-faster for voice transcription SearXNG for local web search Hardware: Ryzen 7 7800X3D, 32GB DDR5-6000, RTX 5070 Ti 16GB. Links: llama.cpp — https://github.com/ggml-org/llama.cpp Qwen 3.6 27B (official) — https://huggingface.co/Qwen/Qwen3.6-27B HauhauCS uncensored — https://huggingface.co/HauhauCS/Qwen3... nanobot — https://github.com/HKUDS/nanobot SearXNG — https://github.com/searxng/searxng Benchmark numbers (Q3_K_P, flash attention, KV cache q8_0): Empty context → 1527 t/s prefill, 42 t/s decode 8K context → 1544 t/s prefill, 43 t/s decode 16K context → 1389 t/s prefill, 41 t/s decode 32K context → 1077 t/s prefill, 30 t/s decode (2 layer offload for bench) Build: llama.cpp 0adede866 (8925) Note: I'm on CUDA 13.1 deliberately. CUDA 13.2 has a known bug producing gibberish outputs with this model. NVIDIA acknowledged it but no fix at the time of recording. Don't update. If you're running a similar setup with a better config, drop it in the comments.