[Scheduling seminar] Zijie Zhou (IEDA, HKUST) | Efficient and Robust LLM Scheduling

Keywords: Scheduling, Optimization for LLM inference, Approximation online algorithms We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total completion time. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. There are two key challenges: (i) each request has heterogeneous prefill and decode lengths. In LLM serving, the prefill length corresponds to the input prompt length, which determines the initial memory usage in the KV cache. The decode length refers to the number of output tokens generated sequentially, with each additional token increasing the KV cache memory usage by one unit. We show that minimizing total completion time is NP-hard due to the interplay of batching, placement constraints, precedence relationships, and linearly increasing memory usage. We then analyze commonly used scheduling strategies in practice, such as First-Come-First-Serve (FCFS) and Shortest-First (SF), and prove that their competitive ratios are unbounded. To address this, we propose a novel algorithm based on a new selection metric that efficiently forms batches over time. We prove that this algorithm achieves a constant competitive ratio. (ii) the output length, which critically impacts memory usage and processing time, is unknown. We first design a conservative algorithm, Amax, which schedules requests based on the upper bound of predicted output lengths to prevent memory overflow. However, this approach is overly conservative: as prediction accuracy decreases, performance degrades significantly due to potential overestimation. To overcome this limitation, we propose Amin, an adaptive algorithm that initially treats the predicted lower bound as the output length and dynamically refines this estimate during inferencing. We prove that Amin achieves a log-scale competitive ratio. Organized by Zdenek Hanzalek (CTU in Prague), Michael Pinedo (New York University), and Guohua Wan (Shanghai Jiao Tong). Seminar's webpage: https://schedulingseminar.com/

Deep Dive: Optimizing LLM inference

Deep Dive: Optimizing LLM inference

Understanding Rollout Staleness and Selectivity for Efficient Reinforcement Learning on LLMs

Understanding Rollout Staleness and Selectivity for Efficient Reinforcement Learning on LLMs

Reliable Predictive Modeling Under Data, Fidelity & Hardware Constraints

Reliable Predictive Modeling Under Data, Fidelity & Hardware Constraints

[Scheduling seminar] Pieter Smet (KU Leuven) | Robustness in personnel rostering

[Scheduling seminar] Pieter Smet (KU Leuven) | Robustness in personnel rostering

Debiao Li (Fuzhou University) | Feature-driven Robust Stochastic

Debiao Li (Fuzhou University) | Feature-driven Robust Stochastic

Shift Scheduling with OR-Tools MIP model in Python

Shift Scheduling with OR-Tools MIP model in Python

LLM Inference Optimization #2: Tensor, Data & Expert Parallelism (TP, DP, EP, MoE)

LLM Inference Optimization #2: Tensor, Data & Expert Parallelism (TP, DP, EP, MoE)

[Scheduling seminar] Changhyun Kwon (KAIST/Omelet, Inc.) | Learning-Based Approaches to Comb. Prob.

[Scheduling seminar] Changhyun Kwon (KAIST/Omelet, Inc.) | Learning-Based Approaches to Comb. Prob.

Alena Otto (TU Munich) Overcoming poor data quality

Alena Otto (TU Munich) Overcoming poor data quality

Proximal Policy Optimization (PPO) - How to train Large Language Models

Proximal Policy Optimization (PPO) - How to train Large Language Models

UCLA Mobility Seminar | Cathy Wu (Massachusetts Institute of Technology)

UCLA Mobility Seminar | Cathy Wu (Massachusetts Institute of Technology)

[Scheduling seminar] Hoogeveen, J.A. (Utrecht Uni) | Planning shunting operations at railway hubs

[Scheduling seminar] Hoogeveen, J.A. (Utrecht Uni) | Planning shunting operations at railway hubs

We're 99.9% sure this pattern is true, but no one can prove it

We're 99.9% sure this pattern is true, but no one can prove it

Bruno Escoffier (LIP6, Sorbonne) Resource Leveling for Scheduling Problems

Bruno Escoffier (LIP6, Sorbonne) Resource Leveling for Scheduling Problems

What do tech pioneers think about the AI revolution? - The Engineers, BBC World Service

What do tech pioneers think about the AI revolution? - The Engineers, BBC World Service

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

6. Monte Carlo Simulation

6. Monte Carlo Simulation

Everything You Need To Know About Large Language Models (LLMs)

Everything You Need To Know About Large Language Models (LLMs)

Python Variables | Python Operators | Python Tutorial For Beginners | Intellipaat

Python Variables | Python Operators | Python Tutorial For Beginners | Intellipaat

Jin QI (Hong Kong UST) | Elective Surgery Sequencing and Scheduling Under Uncertainty

Jin QI (Hong Kong UST) | Elective Surgery Sequencing and Scheduling Under Uncertainty