Beginning of dialog window. Escape will cancel and close the window.
End of dialog window.
详情
字幕
Optimize Generative AI inference with Quantization in TensorRT-LLM and TensorRT
, Senior Deep Learning Engineer, NVIDIA
, Tech Lead Manager, NVIDIA
Because running inference of AI models at large scale is computationally costly, optimization techniques are crucial to lower the inference cost. Our tutorial presents the TensorRT Model Optimization toolkit — a gateway for algorithmic model optimization by NVIDIA. TensorRT Model Optimization toolkit provides a set of state-of-the-art quantization methods, including FP8, Int8, Int4 and mixed precisions, as well as hardware-accelerated sparsity, and bridges those methods with the most advanced NVIDIA deployment solutions such as TensorRT-LLM. This tutorial includes an end-to-end optimization-to-deployment demo for language models with TensorRT-LLM and Stable Diffusion models with TensorRT. You can download the notebooks here: nvidia_ammo-0.9.0.tar.gz.