Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected

      Deploying, Optimizing, and Benchmarking Large Language Models With Triton Inference Server

      , Senior Software Engineer, NVIDIA
      , Principal Software Architect, NVIDIA
      , Senior Software Engineer, NVIDIA
      Learn how to serve large language models (LLMs) efficiently using Triton Inference Server with step-by-step instructions. NVIDIA Triton Inference Server is an open-source inference serving solution that simplifies the production deployment of AI models at scale. With a uniform interface and standard set of metrics, developers can easily deploy deep learning and machine learning models across many different frameworks (TensorRT, TensorRT-LLM, vLLM, TensorFlow, PyTorch, OpenVINO, and more) on multiple types of hardware (CPU and GPU). We’ll review the challenges of serving LLMs and demonstrate how Triton Inference Server’s latest features help overcome them. We’ll cover how to easily deploy an LLM across multiple backends and compare their performance, as well as how to fine-tune deployment configurations for optimal performance. We'll provide step-by-step instructions for anyone to follow using publicly available collateral and answer questions along the way.
      活动: GTC 24
      日期: March 2024
      话题: AI Inference
      行业: All Industries
      级别: Intermediate Technical
      NVIDIA technology: TensorRT,Triton
      语言: English
      所在地: