Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected
      • Quality

      FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness

      , Chief Scientist, Together.AI
      Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. A missing principle is making attention algorithms IO-aware — accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory and GPU on-chip SRAM. FlashAttention trains transformers faster than existing baselines, with 2-4x speedup on the attention kernel. FlashAttention enables longer context in transformers (4-16x longer than previous), yielding higher-quality models. We'll also describe recent improvements of FlashAttention: making use of new hardware features on A100 and H100 GPUs (another 2x speedup), as well as optimizations for long-context LLM inference (2-4x faster end-to-end inference time).
      活动: GTC 24
      日期: March 2024
      级别: Advanced Technical
      话题: AI Inference
      NVIDIA technology: Cloud / Data Center GPU,CUDA,Hopper
      行业: HPC / Scientific Computing
      语言: English
      所在地: