Name: Optimize Generative AI inference with Quantization in TensorRT-LLM and TensorRT | GTC 24 2024 | NVIDIA On-Demand
Uploaded: 2024-03-18T09:30:00Z
Duration: 4597 s
Description: Because running inference of AI models at large scale is computationally costly, optimization techniques are crucial to lower the inference cost

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

Because running inference of AI models at large scale is computationally costly, optimization techniques are crucial to lower the inference cost. Our tutorial presents the TensorRT Model Optimization toolkit — a gateway for algorithmic model optimization by NVIDIA. TensorRT Model Optimization toolkit provides a set of state-of-the-art quantization methods, including FP8, Int8, Int4 and mixed precisions, as well as hardware-accelerated sparsity, and bridges those methods with the most advanced NVIDIA deployment solutions such as TensorRT-LLM. This tutorial includes an end-to-end optimization-to-deployment demo for language models with TensorRT-LLM and Stable Diffusion models with TensorRT. You can download the notebooks here: nvidia_ammo-0.9.0.tar.gz.

活动: GTC 24

日期: March 2024

话题: AI Inference

行业: All Industries

级别: Intermediate Technical

NVIDIA technology: TensorRT

语言: English

所在地: