You can Google all of them + nvidia triton, but here you go...
Mistral[0]:
"Acknowledgement
We are grateful to NVIDIA for supporting us in integrating TensorRT-LLM and Triton and working alongside us to make a sparse mixture of experts compatible with TRT-LLM."
Cloudflare[1]:
"It will also feature NVIDIA’s full stack inference software —including NVIDIA TensorRT-LLM and NVIDIA Triton Inference server — to further accelerate performance of AI applications, including large language models."
Amazon[2]:
"Amazon uses the Text-To-Text Transfer Transformer (T5) natural language processing (NLP) model for spelling correction. To accelerate text correction, they leverage NVIDIA AI inference software, including NVIDIA Triton™ Inference Server, and NVIDIA® TensorRT™, an SDK for high performance deep learning inference."
There are many, many more results for AWS (internally and for customers) with plenty of "case studies", and "customer success stories", etc describing deployments. You can also find large enterprises like Siemens, etc using Triton internally and embedded/deployed within products. Triton also runs on the embedded Jetson series of hardware and there are all kinds of large entities doing edge/hybrid inference with this approach.
You can also add at least Phind, Perplexity, and Databricks to the list. These are just the public ones, look at a high scale production deployment of ML/AI in any use case and there is a very good chance there's Triton in there.
I encourage you to do your own research because the advantages/differences are too many to list. Triton can do everything you listed and often better (especially quantization) but off the top of my head:
- Support for the kserve API for model management. Triton can load/reload/unload models dynamically while running, including model versioning and config params to allow clients to specify model version, require specification of version, or default to latest, etc.
- Built in integration and support for S3 and other object stores for model management that in conjunction with the kserve API means you can hit the Triton API and just tell it to grab model X version Y and it will be running in seconds. Think of what this means when you have thousands of Triton instances throughout core, edge, K8s, etc, etc... Like Cloudflare.
- Multiple backend support with support for literally any model: TF, Torch, ONNX, etc with dynamic runtime compilation for TensorRT (with caching and int8 calibration if you want it), OpenVINO, etc acceleration. You can run any LLM (or multiple), Whisper, Stable Diffusion, sentence embeddings, image classification, and literally any model on the same instance (or whatever) because at the fundamental level Triton was designed for multiple backends, multiple models, and multiple versions. It operates on a in/out concept with tensors or arbitrary data. Which can be combined with the Python and model ensemble support to do anything...
- Python backend. Triton can do pre/post-processing in the framework for things like tokenizers and decoders. With ensemble you can arbitrary chain together inputs/outputs from any number of models/encoders/decoders/custom pre-processing/post-processing/etc. You can also, of course, build your own backends to do anything you need to do that can't be done with included backends or when performance is critical.
- Extremely fine grained control for dispatching, memory management, scheduling, etc. For example the dynamic batcher can be configured with all kinds of latency guarantees (configured in nanoseconds) to balance request latency vs optimal max batch size while taking into account node GPU+CPU availability across any number of GPUs and/or CPU threads on a per-model basis. It also supports loading of arbitrary models to CPU, which can come in handy for acceleration of models that can run well on unused CPU resources - things like image classification/object detection. With ONNX and OpenVINO it's surprisingly useful. This can be configured on a per model basis, with a variety of scheduling/thread/etc options.
- OpenTelemetry (not that special) and Prometheus metrics. Prometheus will drill down to an absurd level of detail with not only request details but also the hardware itself (including temperature, power, etc).
- Support for Model Navigator[3] and Performance Analyzer[4]. These tools are on a completely different level... They will take any arbitrary model, export it to a package, and allow you to define any number of arbitrary metrics to target a runtime format and model configuration so you can do things like:
- p95 of time to first token: X
- While achieving X RPS
- While keeping power utilization below X
They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.
- Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.
- gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine on steroids, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, clients, etc can send/receive raw data/numpy/tensors.
- DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, pipe through whatever you want (all on GPU), and get output back to the network with a single CPU copy for in/out.
vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.
Mistral[0]:
"Acknowledgement We are grateful to NVIDIA for supporting us in integrating TensorRT-LLM and Triton and working alongside us to make a sparse mixture of experts compatible with TRT-LLM."
Cloudflare[1]: "It will also feature NVIDIA’s full stack inference software —including NVIDIA TensorRT-LLM and NVIDIA Triton Inference server — to further accelerate performance of AI applications, including large language models."
Amazon[2]: "Amazon uses the Text-To-Text Transfer Transformer (T5) natural language processing (NLP) model for spelling correction. To accelerate text correction, they leverage NVIDIA AI inference software, including NVIDIA Triton™ Inference Server, and NVIDIA® TensorRT™, an SDK for high performance deep learning inference."
There are many, many more results for AWS (internally and for customers) with plenty of "case studies", and "customer success stories", etc describing deployments. You can also find large enterprises like Siemens, etc using Triton internally and embedded/deployed within products. Triton also runs on the embedded Jetson series of hardware and there are all kinds of large entities doing edge/hybrid inference with this approach.
You can also add at least Phind, Perplexity, and Databricks to the list. These are just the public ones, look at a high scale production deployment of ML/AI in any use case and there is a very good chance there's Triton in there.
I encourage you to do your own research because the advantages/differences are too many to list. Triton can do everything you listed and often better (especially quantization) but off the top of my head:
- Support for the kserve API for model management. Triton can load/reload/unload models dynamically while running, including model versioning and config params to allow clients to specify model version, require specification of version, or default to latest, etc.
- Built in integration and support for S3 and other object stores for model management that in conjunction with the kserve API means you can hit the Triton API and just tell it to grab model X version Y and it will be running in seconds. Think of what this means when you have thousands of Triton instances throughout core, edge, K8s, etc, etc... Like Cloudflare.
- Multiple backend support with support for literally any model: TF, Torch, ONNX, etc with dynamic runtime compilation for TensorRT (with caching and int8 calibration if you want it), OpenVINO, etc acceleration. You can run any LLM (or multiple), Whisper, Stable Diffusion, sentence embeddings, image classification, and literally any model on the same instance (or whatever) because at the fundamental level Triton was designed for multiple backends, multiple models, and multiple versions. It operates on a in/out concept with tensors or arbitrary data. Which can be combined with the Python and model ensemble support to do anything...
- Python backend. Triton can do pre/post-processing in the framework for things like tokenizers and decoders. With ensemble you can arbitrary chain together inputs/outputs from any number of models/encoders/decoders/custom pre-processing/post-processing/etc. You can also, of course, build your own backends to do anything you need to do that can't be done with included backends or when performance is critical.
- Extremely fine grained control for dispatching, memory management, scheduling, etc. For example the dynamic batcher can be configured with all kinds of latency guarantees (configured in nanoseconds) to balance request latency vs optimal max batch size while taking into account node GPU+CPU availability across any number of GPUs and/or CPU threads on a per-model basis. It also supports loading of arbitrary models to CPU, which can come in handy for acceleration of models that can run well on unused CPU resources - things like image classification/object detection. With ONNX and OpenVINO it's surprisingly useful. This can be configured on a per model basis, with a variety of scheduling/thread/etc options.
- OpenTelemetry (not that special) and Prometheus metrics. Prometheus will drill down to an absurd level of detail with not only request details but also the hardware itself (including temperature, power, etc).
- Support for Model Navigator[3] and Performance Analyzer[4]. These tools are on a completely different level... They will take any arbitrary model, export it to a package, and allow you to define any number of arbitrary metrics to target a runtime format and model configuration so you can do things like:
- p95 of time to first token: X
- While achieving X RPS
- While keeping power utilization below X
They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.
- Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.
- gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine on steroids, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, clients, etc can send/receive raw data/numpy/tensors.
- DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, pipe through whatever you want (all on GPU), and get output back to the network with a single CPU copy for in/out.
vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.
[0] - https://mistral.ai/news/la-plateforme/
[1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...
[2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...
[3] - https://github.com/triton-inference-server/model_navigator
[4] - https://github.com/triton-inference-server/client/blob/main/...
[5] - https://github.com/triton-inference-server/dali_backend