Beginning of dialog window. Escape will cancel and close the window.
End of dialog window.
详情
字幕
Deploying, Optimizing, and Benchmarking Large Language Models With Triton Inference Server
, Senior Software Engineer, NVIDIA
, Principal Software Architect, NVIDIA
, Senior Software Engineer, NVIDIA
Learn how to serve large language models (LLMs) efficiently using Triton Inference Server with step-by-step instructions. NVIDIA Triton Inference Server is an open-source inference serving solution that simplifies the production deployment of AI models at scale. With a uniform interface and standard set of metrics, developers can easily deploy deep learning and machine learning models across many different frameworks (TensorRT, TensorRT-LLM, vLLM, TensorFlow, PyTorch, OpenVINO, and more) on multiple types of hardware (CPU and GPU). We’ll review the challenges of serving LLMs and demonstrate how Triton Inference Server’s latest features help overcome them. We’ll cover how to easily deploy an LLM across multiple backends and compare their performance, as well as how to fine-tune deployment configurations for optimal performance. We'll provide step-by-step instructions for anyone to follow using publicly available collateral and answer questions along the way.