PyTorch 2.0 was launched in early December 2022 at NeurIPS 2022 and made a lot of buzz for its main torch.compile component that is expected to bring a great computing speed up over previous versions of PyTorch.
This is amazing news for the world of artificial intelligence, and the early results on training time improvements are impressive. What the PyTorch teams did not mention in the launch press release and on PyTorch GitHub was PyTorch 2.0 inference performance.
Let's investigate more on this topic, and discover how PyTorch 2.0 performs against other inference accelerators such as Nvidia TensorRT and ONNX Runtime.
We ran some inference tests with Speedster, Nebuly's opensource library to apply SOTA optimization techniques and achieve the maximum inference speed-up on your hardware. For this use case, Speedster allowed us to run TensorRT, ONNX Runtime, and combined them with 16 and 8bit dynamic and static quantization in just 2 lines of code. During testing, we also used Speedster to gather performance information on the top strategy to reduce inference latency.
We ran the tests on a Nvidia 3090Ti GPU with a ResNet, the same model used in examples in PyTorch 2.0 press release.
Here are the 4 main insights from the tests:
Be mindful that benchmarks are highly dependent on the data, model, hardware, and optimization techniques used. To achieve the best performance in inference it is always recommended to test all optimizers before deploying a model into production.
With NVIDIA MIG, you can partition a single GPU into up to seven independent instances.
Characteristics and performance of NVIDIA L4 Tensor Core GPU.
The goal of this blog is to explore Meta Research’s new Token Merging (ToMe) optimization strategy, perform some practical experiments with it, and benchmark ToMe with other state-of-the-art inference optimization techniques.