Name: Deploying, Optimizing, and Benchmarking Large Language Models With Triton Inference Server | GTC 24 2024 | NVIDIA On-Demand
Uploaded: 2024-03-18T08:00:00Z
Duration: 3897 s
Description: Learn how to serve large language models (LLMs) efficiently using Triton Inference Server with step-by-step instructions

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

Learn how to serve large language models (LLMs) efficiently using Triton Inference Server with step-by-step instructions. NVIDIA Triton Inference Server is an open-source inference serving solution that simplifies the production deployment of AI models at scale. With a uniform interface and standard set of metrics, developers can easily deploy deep learning and machine learning models across many different frameworks (TensorRT, TensorRT-LLM, vLLM, TensorFlow, PyTorch, OpenVINO, and more) on multiple types of hardware (CPU and GPU). We’ll review the challenges of serving LLMs and demonstrate how Triton Inference Server’s latest features help overcome them. We’ll cover how to easily deploy an LLM across multiple backends and compare their performance, as well as how to fine-tune deployment configurations for optimal performance. We'll provide step-by-step instructions for anyone to follow using publicly available collateral and answer questions along the way.

活动: GTC 24

日期: March 2024

话题: AI Inference

行业: All Industries

级别: Intermediate Technical

NVIDIA technology: TensorRT,Triton

语言: English

所在地: