Local Multimodal RAG Pipeline End-to-End Tutorial | On DGX Spark
Let's build a multimodal RAG (Retrieval Augmented Generation) pipeline using NVIDIA's Nemotron embedding and rerank vision-language models. Multimodal means we'll be able to embed images and text in the same feature space. This allows us to search over images and text simultaneously. We'll learn how to create multimodal embeddings, retrieve them with a query, rerank them if necessary and generate an output based on the retrieved samples. This is a scalable workflow you could take to many different use cases. If you've got a dataset of documents you need to search over, multimodal RAG could be part of the solution. All of this was performed locally on a NVIDIA DGX Spark (see here for more: https://nvda.ws/4iQXZU4). Businesses: If you're a business who needs help creating their own multimodal RAG pipeline, contact me at: https://www.mrdbourke.com/contact/ Links: Source code (book version) - https://www.learnhuggingface.com/note... Source code (GitHub) - https://github.com/mrdbourke/learn-hu... Source code (Colab) - https://colab.research.google.com/dri... YouTube playlist of livestreams - • Multimodal RAG (Retrieval Augmented Genera... Resources: Nemotron RAG models - https://huggingface.co/collections/nv... A Realistic RAG System by Martin Fowler - https://martinfowler.com/articles/gen... Timestamps: 0:00 - Intro and overview 1:42 - What is RAG? 2:29 - RAG vs Fine-tuning 3:25 - A realistic RAG setup 4:15 - What we're going to build 8:35 - Ingredients and tools 9:15 - What are embeddings? (Part 1) 12:07 - What are embeddings? (Part 2 - a helpful resource) 12:39 - Step: Creating the embeddings 15:08 - Step: Retrieving results given a query 21:17 - Step: Reranking retrieved results 23:46 - Code Starts 25:05 - Viewing samples in our dataset 26:29 - Loading models from a specific checkpoint on Hugging Face 28:12 - Creating/loading embeddings 30:22 - Looking at example embeddings 31:00 - Always embed your query with the same model as your documents34:07 - Viewing results of matching a query to document embeddings 36:38 - Using an image as a query 39:19 - Step: Reranking outputs 41:01 - Discussing reranking options 45:20 - Visualizing reranked samples versus the original retrieved results47:54 - Step: Loading a generation model 49:52 - Generating a summary of input recipes 50:28 - Creating a demo (locally) 1:00:34 - Uploading our demo to Hugging Face 1:01:58 - Discussing tidbits, notes and extensions

NVIDIA didn't want me to do this

Is RAG Still Needed? Choosing the Best Approach for LLMs

Local Multimodal RAG on the NVIDIA DGX Spark | Part 1 - Creating a dataset

End-to-End (small) Vision Language Model Fine-tuning Tutorial | On DGX Spark

Kubeflow Trainer and Katib Call - 2026/05/13

OpenClaw: Building Local Memory on DGX Spark

John Carmack Was Right. The Internet Was Wrong.

What is Multimodal RAG? Unlocking LLMs with Vector Databases

Model Context Protocol (MCP) Explained for Beginners: AI Flight Booking Demo!

AI Bubble: How AI's push towards IPOs became a death drive | Ed Zitron

Vibe Coding With A NVIDIA DGX Spark

This AI Supercomputer can fit on your desk...

What does it take to build a Realistic RAG in 2025? | AI & ML Monthly

Qwen3 Multimodal Embeddings: Finally, RAG That Sees

DGX Spark Live: Backend Development with Local LLM Inference

NVIDIA DGX Spark Unboxing, Setup and First Impressions - One plug AI.

A (free) 200 Page LLM Training Playbook, epic OCR models, SAM 3? | AI & ML Monthly

Nvidia DGX Spark: The "Tiny" AI Beast That’s Too Slow for Real Work?

NVIDIA DGX Spark vs RTX 4090 | LLM inference, training speed and more

