How does PyTorch 2.0 perform in inference? A benchmark with TensorRT and ONNX Runtime

PyTorch 2.0 was launched in early December 2022 at NeurIPS 2022 and made a lot of buzz for its main torch.compile component that is expected to bring a great computing speed up over previous versions of PyTorch.
This is amazing news for the world of artificial intelligence, and the early results on training time improvements are impressive. What the PyTorch teams did not mention in the launch press release and on PyTorch GitHub was PyTorch 2.0 inference performance.
Let's investigate more on this topic, and discover how PyTorch 2.0 performs against other inference accelerators such as Nvidia TensorRT and ONNX Runtime.
We ran some inference tests with Speedster, Nebuly's opensource library to apply SOTA optimization techniques and achieve the maximum inference speed-up on your hardware. For this use case, Speedster allowed us to run TensorRT, ONNX Runtime, and combined them with 16 and 8bit dynamic and static quantization in just 2 lines of code. During testing, we also used Speedster to gather performance information on the top strategy to reduce inference latency.
We ran the tests on a Nvidia 3090Ti GPU with a ResNet, the same model used in examples in PyTorch 2.0 press release.
Testing the inference performance of PyTorch 2.0 with Speedster

Here are the 4 main insights from the tests:
- PyTorch 2.0 becomes more and more effective compared to previous versions with larger batch size. And fp16 precision becomes much more efficient than the fp32 compiled version at higher batch. This is easily explainable considering that Pytorch 2.0 compilation has been mainly designed for training, where usually batch size is higher than inference. The focus on fp16 makes sense since the training procedure has recently shifted from full precision to half, in particular for large models.
- ONNX Runtime performs much better than PyTorch 2.0 at smaller batch sizes, while the result is the opposite at larger batch size. Again, this's because ONNX Runtime was designed mainly for inference (where usually smaller batch sizes are used), while as stated before PyTorch 2.0 main goal is training.
- Both PyTorch eager mode and PyTorch 2.0 (compiled) show the same running time for both batch size 1 and 8. This shows that the two runtimes were not using the full computing capacity at batch size equal one, while other inference-driven optimizers like ONNX Runtime were able to better manage the computing power. Again this is probably related to the fact that PyTorch compiler was mainly designed for training, ignoring situations where the batch size is not big enough for using all the computing power with their kernels.
- On the Nvidia GPU tested, TensorRT far outperforms the competition for both small and large batch sizes. In fact, the relative speed becomes even faster as the batch size increases. This shows how Nvidia's engineers were able to make better use of hardware caches at inference time, since the memory occupied by activations grows linearly with batch size, and proper memory usage can greatly improve performance.
Be mindful that benchmarks are highly dependent on the data, model, hardware, and optimization techniques used. To achieve the best performance in inference it is always recommended to test all optimizers before deploying a model into production.
References