Name: FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness | GTC 24 2024 | NVIDIA On-Demand
Uploaded: 2024-03-20T10:00:00Z
Duration: 1295 s
Description: Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. A missing principle is making attention algorithms IO-aware — accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory and GPU on-chip SRAM. FlashAttention trains transformers faster than existing baselines, with 2-4x speedup on the attention kernel. FlashAttention enables longer context in transformers (4-16x longer than previous), yielding higher-quality models. We'll also describe recent improvements of FlashAttention: making use of new hardware features on A100 and H100 GPUs (another 2x speedup), as well as optimizations for long-context LLM inference (2-4x faster end-to-end inference time).

活动: GTC 24

日期: March 2024

级别: Advanced Technical

话题: AI Inference

NVIDIA technology: Cloud / Data Center GPU,CUDA,Hopper

行业: HPC / Scientific Computing

语言: English

所在地: