Is LMDeploy the Ultimate Solution? Why It Outshines VLLM, TRT-LLM, TGI, and MLC

ssheng · on June 20, 2024

How does Exllama rank among these? Heard good things about it.

helloericsf · on June 20, 2024

Seems interesting! https://github.com/turboderp/exllama "A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights."

helloericsf · on June 20, 2024

4-bit quantization tends to come at the cost of output quality losses. https://github.com/ggerganov/llama.cpp/issues/9

ssheng · on June 20, 2024

Quality loss with quantization is expected. It seems like with GPTQ the loss is within acceptable range based on the perplexity score shown.

ShawnBasquiat · on June 20, 2024

Why aren't there more of these benchmark studies? How did TGI make the cut?

timliu9 · on June 20, 2024

Why was onnx not part of the tested runtimes? Seems like an oversight

helloericsf · on June 20, 2024

Personally, I never seen onnx used for LLM.

chaoyu · on June 20, 2024

onnx is not a good option for LLM type of autoregressive generation