The KV Cache: Memory Usage in Transformers

Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io The KV cache is what takes up the bulk of the GPU memory during inference for large language models like GPT-4. Learn about how the KV cache works in this video! 0:00 - Introduction 1:15 - Review of self-attention 4:07 - How the KV cache works 5:55 - Memory usage and example Further reading: Speeding up the GPT - KV cache (https://www.dipkumar.dev/becoming-the...) Transformer Inference Arithmetic (https://kipp.ly/transformer-inference...) Efficiently Scaling Transformer Inference (https://arxiv.org/pdf/2211.05102.pdf)