Inference Optimization with NVIDIA TensorRT
In many applications of deep learning models, we would benefit from reduced latency (time taken for inference). This tutorial will introduce NVIDIA TensorRT, an SDK for high-performance deep learning inference. We will go through all the steps necessary to convert a trained deep learning model to an inference-optimized model on HAL. Speakers: Nikil Ravi and Pranshu Chaturvedi, UIUC Webinar Date: April 13, 2022

▶︎
Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

▶︎
Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

▶︎
AI Inference: The Secret to AI's Superpowers

▶︎
Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

▶︎
Intuition behind Mamba and State Space Models | Enhancing LLMs!

▶︎
Crazy Fast YOLO11 Inference with Deepstream and TensorRT on NVIDIA Jetson Orin

▶︎
Lightning Talk: Triton Compiler - Thomas Raoux, OpenAI

▶︎
Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

▶︎
ONNX and ONNX Runtime

▶︎
From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta

▶︎
Demo: Optimizing Gemma inference on NVIDIA GPUs with TensorRT-LLM

▶︎
Yann LeCun: World Models: Enabling the next AI revolution

▶︎
The World's Most Important Machine

▶︎
Horace He: Building Machine Learning Systems for a Trillion Trillion Floating Point Operations

▶︎
How the VLLM inference engine works?

▶︎
NVIDIA Triton Inference Server and its use in Netflix's Model Scoring Service

▶︎
Scaling Inference Deployments with NVIDIA Triton Inference Server and Ray Serve | Ray Summit 2024

▶︎
Inference, Diffusion, World Models, and More | YC Paper Club

▶︎
Google DeepMind Distinguished Eng (L9): How To Land a Job at a Frontier Lab | Vlad Feinberg

▶︎
