Video Player is loading.
Current Time 0:00
Duration 21:34
Loaded: 0%
Stream Type LIVE
Remaining Time 21:34
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected
    • default, selected

    FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness

    , Chief Scientist, Together.AI
    Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. A missing principle is making attention algorithms IO-aware — accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory and GPU on-chip SRAM. FlashAttention trains transformers faster than existing baselines, with 2-4x speedup on the attention kernel. FlashAttention enables longer context in transformers (4-16x longer than previous), yielding higher-quality models. We'll also describe recent improvements of FlashAttention: making use of new hardware features on A100 and H100 GPUs (another 2x speedup), as well as optimizations for long-context LLM inference (2-4x faster end-to-end inference time).
    活动: GTC 24
    日期: March 2024
    级别: Advanced Technical
    话题: AI Inference
    NVIDIA technology: Cloud / Data Center GPU,CUDA,Hopper
    行业: HPC / Scientific Computing
    语言: English
    所在地: