Ready to get started?
User Analytics for LLMs. Map the sequence of actions users take with your LLMs to drive engagement and satisfaction.
Try it free
tl;dr
PyTorch 2.0 was launched in early December 2022 at NeurIPS 2022, and the main new features are performance improvements. Let's discover how PyTorch 2.0 performs against other inference accelerators.
PyTorch 2.0 was launched in early December 2022 at NeurIPS 2022 and made a lot of buzz for its main torch.compile component that is expected to bring a great computing speed up over previous versions of PyTorch.
This is amazing news for the world of artificial intelligence, and the early results on training time improvements are impressive. What the PyTorch teams did not mention in the launch press release and on PyTorch GitHub was PyTorch 2.0 inference performance.
Let's investigate more on this topic, and discover how PyTorch 2.0 performs against other inference accelerators such as Nvidia TensorRT and ONNX Runtime.
We ran some inference tests with Speedster, Nebuly's opensource library to apply SOTA optimization techniques and achieve the maximum inference speed-up on your hardware. For this use case, Speedster allowed us to run TensorRT, ONNX Runtime, and combined them with 16 and 8bit dynamic and static quantization in just 2 lines of code. During testing, we also used Speedster to gather performance information on the top strategy to reduce inference latency.
We ran the tests on a Nvidia 3090Ti GPU with a ResNet, the same model used in examples in PyTorch 2.0 press release.
Here are the 4 main insights from the tests:
Be mindful that benchmarks are highly dependent on the data, model, hardware, and optimization techniques used. To achieve the best performance in inference it is always recommended to test all optimizers before deploying a model into production.