Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected
      • Quality

      Optimize Generative AI inference with Quantization in TensorRT-LLM and TensorRT

      , Senior Deep Learning Engineer, NVIDIA
      , Tech Lead Manager, NVIDIA
      Because running inference of AI models at large scale is computationally costly, optimization techniques are crucial to lower the inference cost. Our tutorial presents the TensorRT Model Optimization toolkit — a gateway for algorithmic model optimization by NVIDIA. TensorRT Model Optimization toolkit provides a set of state-of-the-art quantization methods, including FP8, Int8, Int4 and mixed precisions, as well as hardware-accelerated sparsity, and bridges those methods with the most advanced NVIDIA deployment solutions such as TensorRT-LLM. This tutorial includes an end-to-end optimization-to-deployment demo for language models with TensorRT-LLM and Stable Diffusion models with TensorRT. You can download the notebooks here: nvidia_ammo-0.9.0.tar.gz.
      活动: GTC 24
      日期: March 2024
      话题: AI Inference
      行业: All Industries
      级别: Intermediate Technical
      NVIDIA technology: TensorRT
      语言: English
      所在地: