Inference Optimization with NVIDIA TensorRT

In many applications of deep learning models, we would benefit from reduced latency (time taken for inference). This tutorial will introduce NVIDIA TensorRT, an SDK for high-performance deep learning inference. We will go through all the steps necessary to convert a trained deep learning model to an inference-optimized model on HAL. Speakers: Nikil Ravi and Pranshu Chaturvedi, UIUC Webinar Date: April 13, 2022

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference
▶︎

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou
▶︎

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

AI Inference: The Secret to AI's Superpowers
▶︎

AI Inference: The Secret to AI's Superpowers

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA
▶︎

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Intuition behind Mamba and State Space Models | Enhancing LLMs!
▶︎

Intuition behind Mamba and State Space Models | Enhancing LLMs!

Crazy Fast YOLO11 Inference with Deepstream and TensorRT on NVIDIA Jetson Orin
▶︎

Crazy Fast YOLO11 Inference with Deepstream and TensorRT on NVIDIA Jetson Orin

Lightning Talk: Triton Compiler - Thomas Raoux, OpenAI
▶︎

Lightning Talk: Triton Compiler - Thomas Raoux, OpenAI

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026
▶︎

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

ONNX and ONNX Runtime
▶︎

ONNX and ONNX Runtime

From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta
▶︎

From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta

Demo: Optimizing Gemma inference on NVIDIA GPUs with TensorRT-LLM
▶︎

Demo: Optimizing Gemma inference on NVIDIA GPUs with TensorRT-LLM

Yann LeCun: World Models: Enabling the next AI revolution
▶︎

Yann LeCun: World Models: Enabling the next AI revolution

The World's Most Important Machine
▶︎

The World's Most Important Machine

Horace He: Building Machine Learning Systems for a Trillion Trillion Floating Point Operations
▶︎

Horace He: Building Machine Learning Systems for a Trillion Trillion Floating Point Operations

How the VLLM inference engine works?
▶︎

How the VLLM inference engine works?

NVIDIA Triton Inference Server and its use in Netflix's Model Scoring Service
▶︎

NVIDIA Triton Inference Server and its use in Netflix's Model Scoring Service

Scaling Inference Deployments with NVIDIA Triton Inference Server and Ray Serve | Ray Summit 2024
▶︎

Scaling Inference Deployments with NVIDIA Triton Inference Server and Ray Serve | Ray Summit 2024

Inference, Diffusion, World Models, and More | YC Paper Club
▶︎

Inference, Diffusion, World Models, and More | YC Paper Club

Google DeepMind Distinguished Eng (L9): How To Land a Job at a Frontier Lab | Vlad Feinberg
▶︎

Google DeepMind Distinguished Eng (L9): How To Land a Job at a Frontier Lab | Vlad Feinberg

AI Agent Inference Performance Optimizations + vLLM vs. SGLang vs. TensorRT w/ Charles Frye (Modal)
▶︎

AI Agent Inference Performance Optimizations + vLLM vs. SGLang vs. TensorRT w/ Charles Frye (Modal)