Name: Universal Model Serving via Triton and TensorRT | GTC 24 2024 | NVIDIA On-Demand
Uploaded: 2024-03-21T10:30:00Z
Duration: 1374 s
Description: Serving multiple models trained from multiple deep learning frameworks (TensorFlow, PyTorch, etc

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

Serving multiple models trained from multiple deep learning frameworks (TensorFlow, PyTorch, etc.) is tedious, as we need to maintain multiple serving infrastructure such as TFServing and TorchServe. The Triton inference server provides a uniformed serving backend solution for the models trained from different training frameworks. Using a single universal serving platform greatly reduces the hassles in model deployment. We'll show you how to deploy a TensorFlow savedmodel instance, as well as an ensemble model containing a Python/PyTorch preprocessing model and an ONNX inference model. We'll also demonstrate how to use the Triton model analyzer to find the optimal serving parameters. Finally, we'll show how to incorporate TensorRT in serving to significantly reduce the serving cost without compromising the quality of service.

活动: GTC 24

日期: March 2024

话题: AI Inference

NVIDIA technology: Cloud / Data Center GPU,TensorRT,Triton

级别: Intermediate Technical

行业: Media & Entertainment

语言: English

所在地: