Congratulations to Ultralytics on the release of its new model! I was excited to try it out and see how it performs. I was particularly interested in figuring out how to optimize it to make it even faster in inference. I built a notebook to experiment with different optimization techniques using Speedster. Don’t forget to leave a star on GitHub to support the work ⭐
Key takeaways on YOLOv8 inference optimization
Here are some insights that I think are worth sharing:
- On both GPUs and CPUs, the compilers with the smallest latency are Nvidia’s TensorRT and Intel’s OpenVino. This is not surprising, as both hardware manufacturers have heavily invested in optimizing performance for standard computer vision operations.
- YOLOv8 is primarily composed of convolutional layers (and bottleneck layers, which are also made of convolutions). Both Intel and Nvidia have developed highly optimized kernels for these types of operations, which can be leveraged to achieve optimal performance on their hardware.
- ONNXRuntime, however, seems to perform poorly on CPUs, particularly when using int8 and fp16 precision.For int-8 conversion of the weights, there is an extra computational overhead due to the need to convert the weights back to fp32 before computing the output of the layers.The reason for this poor performance on half precision is that ONNXRuntime on CPUs does not include (like PyTorch) FP16 kernels for many operations, resulting in activations and outputs requiring conversion from fp16 to fp32 before operations that don’t support half-precision can be computed.
- On Nvidia GPUs, ONNXRuntime with dynamic int-8 precision (only weights are quantized) also suffers from an overhead due to the conversion from int8 to fp32 when computing different operations. In contrast, TensorRT has highly optimized kernels for static quantization operations (int8 weights and int8 activations) which provide the fastest performance on Nvidia 3090Ti GPUs.