Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected

      Universal Model Serving via Triton and TensorRT

      , Machine Learning Engineer, Snap, Inc.
      Serving multiple models trained from multiple deep learning frameworks (TensorFlow, PyTorch, etc.) is tedious, as we need to maintain multiple serving infrastructure such as TFServing and TorchServe. The Triton inference server provides a uniformed serving backend solution for the models trained from different training frameworks. Using a single universal serving platform greatly reduces the hassles in model deployment. We'll show you how to deploy a TensorFlow savedmodel instance, as well as an ensemble model containing a Python/PyTorch preprocessing model and an ONNX inference model. We'll also demonstrate how to use the Triton model analyzer to find the optimal serving parameters. Finally, we'll show how to incorporate TensorRT in serving to significantly reduce the serving cost without compromising the quality of service.
      活动: GTC 24
      日期: March 2024
      话题: AI Inference
      NVIDIA technology: Cloud / Data Center GPU,TensorRT,Triton
      级别: Intermediate Technical
      行业: Media & Entertainment
      语言: English
      所在地: