Filip Makraduli – One GPU, Four Retrieval Modes: Multi-Model Search Serving #bbuzz

More: https://2026.berlinbuzzwords.de/sessi... Speaker: Filip Makraduli Competitive search now needs dense embeddings, sparse vectors, ColBERT, and cross-encoder reranking. Most teams run four separate containers. This talk shows how to serve all four from one process, walks through building a hybrid retrieval pipeline with real benchmark data, and covers where each retrieval mode wins and where it wastes compute. Every production search system in 2026 runs multiple models. A dense embedder handles semantic search. A sparse model provides keyword recall. A multi-vector model like ColBERT enables token-level matching. A cross-encoder reranker improves final precision. These four stages have become table stakes for competitive retrieval quality. The infrastructure story is less elegant. The industry default is one container per model, typically using HuggingFace TEI, Triton, or a custom Flask wrapper. Four models means four separate deployments, four sets of scaling rules, and four GPU allocations where each model uses a fraction of what it reserves. When building SIE, an open-source search inference engine, we took a different approach: one server process that handles all four retrieval modes through a unified API with three primitives (encode, score, extract). Models like BGE-M3 return dense, sparse, and multi-vector outputs from a single encode call. Cross-encoder reranking uses the score primitive. Same server, same GPU, same API. The talk covers four areas. First, why hybrid retrieval requires multiple model types. We will walk through a real retrieval pipeline: sparse for keyword recall, dense for semantic matching, ColBERT for token-level precision, and a cross-encoder for final reranking. For each stage we will show what it adds to retrieval quality using BEIR benchmark data, and when the added complexity is not worth it. Second, the adapter architecture that makes multi-model serving possible. SIE wraps PyTorch, FlashAttention, SentenceTransformers, and SGLang behind a common interface. We will walk through the lifecycle of a request: API call, tokenization on CPU, batching, GPU inference, and postprocessing. Different model architectures need different compute backends, and we will explain why a single unified runtime was not the right choice. Third, building the pipeline end to end. A practical walkthrough of dense + sparse + ColBERT + reranking from a single server instance, including how to combine scores from different retrieval modes and how to tune the balance between recall and precision. Fourth, tradeoffs and lessons. When does multi-model serving on one GPU work well, and when should a model get its own dedicated container? What happens under concurrent load when multiple models compete for memory? We will share real data from running these workloads on L4 GPUs. ### Follow us on Social Media and join the Community! Mastodon: https://floss.social/@berlinbuzzwords LinkedIn: / berlin-buzzwords Website: https://berlinbuzzwords.de Mail: [email protected] Berlin Buzzwords is an event by Plain Schwarz – https://plainschwarz.com

Rahul Goswami – Zero downtime index upgrade in Apache Solr #bbuzz

Rahul Goswami – Zero downtime index upgrade in Apache Solr #bbuzz

Hartmut Armbruster – What If We've Been Scaling Stream Processing Wrong All Along #bbuzz

Hartmut Armbruster – What If We've Been Scaling Stream Processing Wrong All Along #bbuzz

Google & AWS Veteran: What Top Tier Software Architects Actually Do

Google & AWS Veteran: What Top Tier Software Architects Actually Do

Ravindra Harige – The Three-Body Problem of Inverse Hybrid Search #bbuzz

Ravindra Harige – The Three-Body Problem of Inverse Hybrid Search #bbuzz

Scott and Mark learn...how agents reshape software engineering | BRK247

Scott and Mark learn...how agents reshape software engineering | BRK247

Why Aliens Would NEVER Invade Africa

Why Aliens Would NEVER Invade Africa

How Huawei Just Built an Impossible Chip

How Huawei Just Built an Impossible Chip

Something is jamming GPS over Europe. Here's what we found

Something is jamming GPS over Europe. Here's what we found

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Is the AI Boom About to COLLAPSE?

Is the AI Boom About to COLLAPSE?

Amine Gani, Roudy Khoury – Beyond Grep: Search for Reliable Coding Agents #bbuzz

Amine Gani, Roudy Khoury – Beyond Grep: Search for Reliable Coding Agents #bbuzz

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Demis Hassabis: Why AGI is Bigger than the Industrial Revolution & Where Are The Bottlenecks in AI

Demis Hassabis: Why AGI is Bigger than the Industrial Revolution & Where Are The Bottlenecks in AI

How I’m Preparing For The “Supercycle”

How I’m Preparing For The “Supercycle”

Jarek Potiuk – Empowering OSS maintainers in the age of AI #bbuzz

Jarek Potiuk – Empowering OSS maintainers in the age of AI #bbuzz

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Co-Creator of Haskell: Why Learn Functional Programming, Useless vs Useful Languages | Simon Jones

Co-Creator of Haskell: Why Learn Functional Programming, Useless vs Useful Languages | Simon Jones

I Gave ChatGPT a Body

I Gave ChatGPT a Body

Johannes Kolbe – Escaping the Cloud: High-Performance AI in your Browser #bbuzz

Johannes Kolbe – Escaping the Cloud: High-Performance AI in your Browser #bbuzz

Anthopic, OpenAI Should Not Be Allowed to IPO, Says Ed Zitron

Anthopic, OpenAI Should Not Be Allowed to IPO, Says Ed Zitron