Inference Optimization with NVIDIA TensorRT

In many applications of deep learning models, we would benefit from reduced latency (time taken for inference). This tutorial will introduce NVIDIA TensorRT, an SDK for high-performance deep learning inference. We will go through all the steps necessary to convert a trained deep learning model to an inference-optimized model on HAL. Speakers: Nikil Ravi and Pranshu Chaturvedi, UIUC Webinar Date: April 13, 2022

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

AI Inference: The Secret to AI's Superpowers

AI Inference: The Secret to AI's Superpowers

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Intuition behind Mamba and State Space Models | Enhancing LLMs!

Intuition behind Mamba and State Space Models | Enhancing LLMs!

Crazy Fast YOLO11 Inference with Deepstream and TensorRT on NVIDIA Jetson Orin

Crazy Fast YOLO11 Inference with Deepstream and TensorRT on NVIDIA Jetson Orin

Lightning Talk: Triton Compiler - Thomas Raoux, OpenAI

Lightning Talk: Triton Compiler - Thomas Raoux, OpenAI

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

ONNX and ONNX Runtime

ONNX and ONNX Runtime

From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta

From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta

Demo: Optimizing Gemma inference on NVIDIA GPUs with TensorRT-LLM

Demo: Optimizing Gemma inference on NVIDIA GPUs with TensorRT-LLM

Yann LeCun: World Models: Enabling the next AI revolution

Yann LeCun: World Models: Enabling the next AI revolution

The World's Most Important Machine

The World's Most Important Machine

Horace He: Building Machine Learning Systems for a Trillion Trillion Floating Point Operations

Horace He: Building Machine Learning Systems for a Trillion Trillion Floating Point Operations

How the VLLM inference engine works?

How the VLLM inference engine works?

NVIDIA Triton Inference Server and its use in Netflix's Model Scoring Service

NVIDIA Triton Inference Server and its use in Netflix's Model Scoring Service

Scaling Inference Deployments with NVIDIA Triton Inference Server and Ray Serve | Ray Summit 2024

Scaling Inference Deployments with NVIDIA Triton Inference Server and Ray Serve | Ray Summit 2024

Inference, Diffusion, World Models, and More | YC Paper Club

Inference, Diffusion, World Models, and More | YC Paper Club

Google DeepMind Distinguished Eng (L9): How To Land a Job at a Frontier Lab | Vlad Feinberg

Google DeepMind Distinguished Eng (L9): How To Land a Job at a Frontier Lab | Vlad Feinberg

AI Agent Inference Performance Optimizations + vLLM vs. SGLang vs. TensorRT w/ Charles Frye (Modal)

AI Agent Inference Performance Optimizations + vLLM vs. SGLang vs. TensorRT w/ Charles Frye (Modal)