Playing with ToMe and benchmarking it against other inference optimization strategies
Image taken from Meta’s blog: “Token Merging: Your ViT but faster”.
The goal of this blog is to explore Meta Research’s new Token Merging (ToMe) optimization strategy, perform some practical experiments with it, and benchmark ToMe with other state-of-the-art inference optimization techniques with the opensource library Speedster.
We will try to answer a few questions:
Let’s first explore ToMe.
Token Merging (ToMe) is a technique recently introduced by Meta AI to reduce the latency of existing Vision Transformer (ViT) models without the need for additional training. ToMe gradually combines similar tokens into a transformer using an algorithm as lightweight as pruning while maintaining better accuracy.
ToMe introduces a module for token merging into an existing ViT, merging redundant tokens to improve both inference and training throughput.
ToMe accuracy vs inference speed performance. Image from Meta’s blog: “Token Merging: Your ViT but faster”.
ViT converts image patches into “tokens”. Then, it applies an attention mechanism to each layer that allows these tokens to collect information from one another proportionally to their similarity. To improve the speed of ViT while maintaining its accuracy, ToMe builds on two observations:
In each transformer block, tokens are combined (and thus reduced in number) by a quantity r of tokens per layer. Over the L blocks in the network, a number of rL tokens are merged. By varying the parameter r, we get a speed-accuracy trade-off, as fewer tokens means lower accuracy but higher throughput.
The image below shows how the token merging step is applied between the attention and MLP branches of each transformer block. Step by step, the dog’s fur is merged into a single token.
Image taken from Meta’s paper: “Token Merging: Your ViT but faster”.
ToMe reduces the number of tokens by combining similar ones. The similarity between tokens is defined using self-attention QVKs. Specifically, the keys (K) summarise the information contained in each token. A dot product similarity metric (e.g. cosine similarity) between the keys of each token is then defined as a metric that measures the similarity between the different tokens, in order to understand whether they contain similar information.
So, what’s the accuracy-latency trade-off with ToMe?
Let’s have a look at the results reported in the paper, which were obtained on a V100 GPU. I plotted the accuracy of ViT as a function of the hyperparameter r, where the original ViT corresponds to r=0. We can see that smaller values of r correspond to a model that is slower but with accuracy more faithful to the original model. Large values of r result in a considerable acceleration of the model but a loss in accuracy. For instance, to achieve 2x acceleration from the original model, the model loses 4 points of accuracy.
Results in inference for the ViT-B/16 model.
All results shown are for the ViT-B/16 model, which is also the model that will be used in this notebook for the various experiments as well.
Let’s see if we can reproduce Meta’s results.
I ran the experiments on ViT-B/16 model on a GPU V100 with batch size 64 as in the paper, the recommended value of the hyperparameter r=16. Then, I tested ToMe on smaller batch sizes up to batch size=1, and replicated the same experiment on a CPU E5–2686. A p3.2xlarge instance of aws was used for all experiments.
The results are very interesting:
Throughput graph for the original model and the model to which ToMe was applied, as the batch size varies.
Using ToMe is straightforward. Moreover, thanks to Nebullvm’s benchmark function, we can quickly evaluate the performance on both GPU and CPU of the original model and the optimized model:
As a next step, I compared the performance of ToMe with what can be achieved by other optimization techniques. I used the Speedster library to run the optimizations and see the performance on CPU and GPU.
Speedster is an open-source module designed to speed up AI inference in just a few lines of code. Its use is very simple. The library automatically applies the best set of SOTA optimization techniques to achieve the maximum inference speed-up (of latency and throughput, while compressing the model size) physically possible on the available hardware.
The optimization workflow consists of 3 steps: select, search, and serve.
📚 Select step: in this step, users input their model in their preferred deep learning framework and express their preferences regarding maximum consented accuracy loss and optimization time. This information is used to guide the optimization process and ensure that the resulting model meets the user’s needs.
🔍 Search step: the library automatically tests every combination of optimization techniques across the software-to-hardware stack, such as sparsity, quantization, and compilers, that is compatible with the user’s preferences and local hardware. This allows the library to find the optimal configuration of techniques for accelerating the model.
🍵 Serve step: in this final step, the library returns an accelerated version of the user’s model in the DL framework of choice, providing a significant boost in performance.
The model is optimized by the 4 Speedster blocks shown in the image below. How they work is presented in the library documentation.
Image taken from Speedster documentation.
I performed two experiments with Speedster. First, I performed only optimization with techniques that have no impact on model performance. This is achieved by setting the parameter metric_drop_ths=0.
Next, I increased the metric_drop threshold to 0.05 so that speedster could also apply techniques that slightly change the accuracy to provide better speedup, such as quantization and compression. The 0.05 value is very low, which means that we expect the accuracy to remain essentially unchanged, as explained in the documentation.
Let’s analyze the results:
Throughput graph for the original model and the model to which Speedster was applied, as the batch size varies.
Using Speedster is very simple, and again performance is measured using the benchmark function. Optimization performance is also automatically displayed in the logs when you run the optimization.
In the notebook you can find a section where you can test ToMe on your images, with the possibility of changing the hyperparameter r that adjusts the level of optimization. I did some testing and I felt 100% in the Infinity War movie:
ToMe test on a picture of me.
This experiment can be done by preprocessing your image and using ToMe’s make_visualization method:
ToMe makes it possible to accelerate Visual Transformer models, both on GPU and CPU. One interesting thing to notice is that ToMe improves the model’s speed on CPU inference, but reduces it on GPU when the batch size is low. This can be explained by the fact that CPU uses its full compute power for smaller batch sizes, while GPU has more room for parallel computation. Therefore, ToMe’s overhead on CPU is offset by the token reduction, but not on GPU until the batch size is large enough. This can also be seen from the graphs below:
Results obtained with different optimization techniques, with various values for batch size.
Here we can see that Speedster accepting a 5% performance loss is significantly faster than the original model, remembering that ToMe also leads to performance losses the comparison between the techniques can be considered fair. While on the CPU ToMe appears to be the fastest technique, so it might be interesting to implement its automatic use within Speedster. I opened an issue on Speedster GitHub so that anyone can contribute.
And that’s it! If you are interested in AI optimization or if you liked this notebook please leave a star at our repo Speedster 💕🌟!
Do you also want to play with ToMe? I have prepared a notebook very similar to this blog, where you can also test ToMe by yourself… And get beautiful pictures with ToMe 💁
With NVIDIA MIG, you can partition a single GPU into up to seven independent instances.
Ultralytics just released YOLOv8, a state-of-the-art model that builds on the success of previous YOLO versions and includes new features and improvements to further increase its performance and versatility.