Image created with DALL-E.

The article surveys various approaches for efficient Transformer inference.

In Part 1 we analyzed the bottlenecks and hardware implications of Transformers. In this second part of the survey, we will discuss about Transformer optimization, and neural architecture search. We will also overview a case study using Gemmini, an open-source deep neural network accelerator generator, and demonstrate that a full-stack co-design approach with the surveyed methods can yield significant speedups with minimal performance degradation for Transformer inference.

There are several methods to improve the efficiency of a pre-designed and trained DNN model on a target hardware platform. The strategies discussed include quantization and sparsity, as well as Transformer-specific optimization methods for features such as attentions and nonlinear operations.

Deep neural networks (DNN) are often trained using high-precision floating-point computations, but this high precision is often unnecessary for inference. Quantization is a procedure that compresses DNN models by representing parameters and/or activations with lower-bit, typically fixed-point representations such as 8-bit integers (INT8), instead of 32-bit or 16-bit floating point (FP32 or FP16). Quantization offers several advantages for efficient DNN inference, including reduced memory consumption, reduced size, latency, and energy consumption of arithmetic logic units (ALUs) and corresponding processing elements (PEs), and the ability to deploy models on integer-only hardware. For example, as illustrated in Fig. 9, performing INT8 addition can be ∼30× more energy efficient and ∼120× more area efficient, as compared to the FP32 counterpart.

Quantization methods can be broadly categorized into uniform and non-uniform quantization depending on how they map the values, with uniform quantization currently being the de-facto method for its simplicity and efficient mapping to hardware. Uniform quantization divides the floating-point domain into evenly spaced intervals and assigns each interval to a fixed point value. Non-uniform quantization, however, doesn’t require the intervals to be evenly spaced. By assigning more quantization bins to important regions, non-uniform quantization can more accurately capture the original data distribution in the floating-point domain than uniform quantization, resulting in generally improved compression rates.

Lower bit quantization can improve compression rates, but reducing precision too aggressively can significantly reduce model accuracy, mixed-precision quantization is a promising strategy for achieving a balance between performance gains and maintaining model accuracy.

Another challenge with quantizing pre-trained Transformer models is the presence of outliers in activations, outlier issues in activations can be solved by non-uniform quantization or uniform quantization schemes that assign larger bit precisions to activations containing outliers.

Fig 9. Image taken from the paper “Full Stack Optimization of Transformer Inference: a Survey”. (Left) Comparison between peak throughput for different bit-precision logic on Titan RTX and A100 GPU. (Right) Comparison of the corresponding energy cost and relative area cost for different precision for 45nm technology. As one can see, lower precision provides exponentially better energy efficiency and higher throughput.

Sparsity or pruning is a common technique used to reduce the overall computations required for deep learning inference. This involves removing redundant/insensitive parameters from DNN models, making them sparse. Although having a densely parameterized model may be crucial for successful training, it’s been noticed that it’s possible to eliminate many of the parameters post-training without compromising quality. Training large models and then compressing them using pruning has been found to achieve higher accuracy compared to training a compressed model from the beginning.

Pruning can be categorized into unstructured and structured pruning.

Unstructured pruning can save significant computational resources without accuracy loss, but it can be difficult to implement effectively in hardware due to the need for compressed memory and specialized computation units. This can limit its efficiency on commodity deep neural network accelerators. Structured pruning circumvents these limitations by removing structured sets of parameters, such as rows and columns in linear layers or entire layers in Transformers. It can be more efficient than unstructured pruning because it immediately leads to smaller dense matrix multiplications, but the compression rate may not be as good. State-of-the-art unstructured pruning methods can remove up to 90% of the parameters in BERT without affecting performance, while structured pruning methods can only achieve the same performance by removing up to 70% of the parameters.

Although the pruning methods mentioned earlier fall under weight pruning, there is another pruning technique known as activation pruning, which is also referred to as dynamic pruning, that it can be used to detect and zero out unimportant activations during runtime, reducing the sequence length that Transformers need to process. Token pruning is a popular branch of activation pruning that drops less important tokens in each Transformer layer, resulting in a reduction of up to 30–50% in computations required without noticeable accuracy loss. However, dynamically sparsifying neural networks can be challenging, requiring the design of algorithm and hardware together. Regardless of the pruning methodology used, the primary concern is to determine which weights to preserve and which to remove to improve network efficiency without sacrificing performance.

Common methodologies for pruning Transformers include magnitude pruning, which prunes weights with the smallest magnitudes during the training process; movement pruning, which assigns a larger importance score to weights that move further away from zero during fine-tuning; first-order pruning, which uses gradients with respect to the loss to evaluate the importance of weights; and second-order pruning, which uses the Hessian matrix with respect to the loss as a proxy for importance. While second-order information is more accurate, the Hessian matrix’s large size requires an appropriate and scalable approximation, often with algorithms from randomized numerical linear algebra.

Off-the-shelf optimization techniques like quantization and pruning can improve Transformer performance significantly. However, Transformer-specific optimization techniques that take advantage of the architecture’s unique features can further enhance inference. This article provides a review of some significant Transformer-specific optimization techniques.

This passage discusses various approaches to optimize the attention mechanism in the MHA module, which is the dominant portion of the overall runtime for long sequences. One approach is token pruning, which involves removing unimportant tokens to reduce sequence length. Another approach is leveraging the dynamic sparsity patterns of attention score activations, where specialized hardware logic is required to detect and accelerate those patterns on-the-fly. Several hardware-software co-design approaches are discussed, including top-k hardware engines, angle approximation, clustering, and multi-precision computation. Specialized hardware support is essential for accelerating attention mechanisms and taking advantage of dynamic sparsity patterns for skipping computations.

The Transformer architecture contains complex nonlinear functions that pose challenges for efficient hardware design. One solution is to use function approximation to obtain a good yet computationally efficient approximation. Lookup tables are also widely used to accelerate the nonlinear function by storing pre-calculated output values for a given range of inputs. Recent approaches aim to reduce the size of the lookup table to save area and latency.

The article discusses the challenge of long inference latency during generative tasks using Transformer decoding, and the need to optimize this process. One approach is to use early exiting, where the depth of the decoder is dynamically adjusted for each token generation.

Decoding tasks pose a significant challenge as opposed to encoding tasks because of the attention mechanism, which requires the activation of all previous tokens to generate one token. When a previous token is terminated early, it results in no attention for the skipped layers. To overcome this issue, CALM introduces “state propagation” that duplicates the activations of the final layer before exiting to all the skipped layers.

Another approach is to collaboratively use multiple models with different sizes, which allows the execution of the large model to be carried out less frequently, while utilizing hardware more efficiently. Big Little Decoder has shown to reduce inference latency without compromising generation quality.

The article discusses the importance of taking into account both hardware and software characteristics when selecting optimization techniques for Transformer architecture. Accelerators with a unified datapath tend to pursue more general optimizations, while more exotic optimizations can be pursued separately for the MHA and FFN modules if they are computed in separate datapaths or if the PEs can be reconfigured. However, using separate datapaths or reconfigurable PEs may incur additional overhead. Therefore, there is a tradeoff between area overhead and performance gain when selecting optimization techniques.

The passage discusses the challenge of mapping a Transformer block to hardware instructions to optimize performance, covering key decisions, mapping techniques, performance modeling, and Transformer-specific considerations.

The concept of mapping or scheduling involves sequencing hardware instructions to perform operations on a specific hardware architecture. Different mapping decisions can result in multiple valid mappings, and the space of all possible mappings is referred to as the *mapspace*. The goal of mapping is to find Pareto-optimal, valid mappings that meet desired performance metrics.

In certain cases, it may not be necessary to find a suitable mapping for certain operations. This could be due to the problem being straightforward with a small mapspace, or the operation itself not being a significant performance limitation that requires careful scheduling. However, for fundamental computational operators in Deep Neural Networks (DNNs) such as Transformers, the task of finding a mapping can be difficult due to the vast mapspace, but also highly beneficial in terms of achieving faster overall model execution.

Examples of key operators and their potential mappings for CNNs and Transformers are illustrated in Fig 10 and 11, respectively. These mappings demonstrate that each nested loop level must be assigned to execute either with data from DRAM or local accelerator memory, executed either spatially or temporally (if parallelized compute resources are available), and executed as one loop or tiled into multiple subloops with specific tiling factors. For the Gemmini accelerator, the spatial mapping decision involves determining which loop levels should be executed on the N-by-N systolic array mesh of processing elements (PEs).

Fig 10. Image taken from the paper “Full Stack Optimization of Transformer Inference: a Survey”. Visualization of mapping for the convolution operation in CNNs onto a typical DNN accelerator. The convolution is represented by a six-nested loop without the batch dimension. The loop permutation, spatio-temporal mapping, and tiling factors determine the execution order of each loop level, the usage of accelerator hardware resources, and the loop bounds, respectively. As illustrated in the example, the input channel size dimension is tiled with a tiling factor of 8, resulting in two sub-loops with loop variables 𝑐0 and 𝑐1.

Fig 11. Image taken from the paper “Full Stack Optimization of Transformer Inference: a Survey”. Visualization of mapping for the matmul operation in Transformer encoder/decoders onto a DNN accelerator. The matrix multiplication is represented as a three-nested loop. The loop permutation, spatio-temporal mapping, and tiling factors determine the execution order of each loop level, the usage of accelerator hardware resources, and the loop bounds, respectively. As illustrated in the example, the output column dimension is tiled with a tiling factor of 4, resulting in two sub-loops with loop variables 𝑛0 and 𝑛1. Despite having only three nested loops compared to six in convolutions, finding an optimal mapping for matmuls can still be challenging.

Mapping involves two steps: first, the graph is transformed into a set of tensor operations, and second, each tensor operation is scheduled to transform it into hardware instructions. The first step may involve fusion, sparsification, and quantization strategies.

Graph-level scheduling pertains to decisions that alter the computational graph’s structure rather than only the execution schedule of its individual nodes representing tensor operations. Common changes include:

*Layer fusion*or*operation fusion*involves combining multiple layers into a single tensor operation to be executed on the accelerator, reducing interlayer communication. Layer fusion may not be as effective for reducing latency in Transformers as it is in CNNs because it’s more difficult to find static fusion opportunities. While it’s possible to combine operations in the same kernel for Transformers, it may increase intralayer communication too much, making it infeasible, and the success of this approach can also depend on the target hardware platform.*Dynamic sparsification*of tensors is a method of reducing the number of arithmetic operations required in an operation by using pruning decisions based on activation maps. Common methods include locality-sensitive hashing. However, it heavily depends on data and cannot always be estimated a priori. There are few results on sparsity-aware mapping, largely covering operation-level mappings for a given amount of sparsity.*Static sparsification*of tensors involves pruning decisions that are independent of activations and are determined statically. Structured sparsity results in high speedup but often results in accuracy degradation, while unstructured sparsity retains accuracy even with extreme sparsity levels but is hard to accelerate. Unstructured sparsity is becoming more important due to its ability to reduce memory traffic, which is becoming a major bottleneck for power consumption.

The step of operation-level scheduling breaks down tensor operations into tasks that can be executed on a specific architecture, and it involves multiple decision-making steps for the programmer, including:

- Dividing the operation into tiles that can fit onto different layers of the memory hierarchy
- Determining the dataflow of the computation involves deciding on the order in which the tiles are executed and which tensors are stationary or moved across the processor. This can be formulated as a loop ordering problem, where the innermost loops correspond to the axes of the tensors that are held stationary
- Deciding which axes to parallelize, and which to run serially, which we refer to as spatio-temporal mapping
- During the operation-level scheduling step, deciding how to overlap communication and computation to minimize latency is important. One technique is to use double-buffering where the scratchpad is split into two halves, with one half used for computation while the other is loaded with data from memory
- Mapping arithmetic instructions to hardware instructions, which could be as simple as replacing a matrix multiplication operation of the appropriate size with a call to the appropriate instruction set architecture (ISA) instruction, achieved by tiling. For some architectures, it may involve selecting between different vector instructions that may impact the decision of which axes to vectorize, and the resulting spatio-temporal mapping

The choice of points in the mapspace affects performance greatly, and the goal of a hardware mapper is to select a point to minimize a cost such as energy, latency, etc. However, the mapspace’s size makes exploration difficult, and the optimal mapping can vary based on the hardware architecture. This increases the difficulty of mapping within a codesign context, as a separate mapping must be computed for each neural network and hardware architecture pair.

To handle the large search space, several techniques for mapping accelerators have been developed, along with complete compiler frameworks. These are outlined below.

Mapping algorithms address the challenge of the extensive search space by narrowing down their focus to a subspace of the mapspace. They only make decisions about a subset of the required steps for mapping the network onto the architecture.

**Graph-level Schedulers**.**Operation-level Mappers**. The article discusses operation-level mappers that help in selecting a good point in the search space to achieve high efficiency and utilization in ML accelerators.

Mappers can be divided into three general categories, based on how they make their decisions: brute-force search, feedback-driven approaches, and constrained-optimization approaches. Brute-force methods exhaustively explore or randomly sample a large number of points from the mapspace, while feedback-driven approaches use ML algorithms to learn the scheduling space. Constrained-optimization approaches formulate scheduling problems as a numerical optimization problem to determine variable assignments subject to given constraints and objective functions. Popular techniques include Mixed Integer Programming (MIP) and polyhedral optimizations.

The passage discusses the challenge of mapping a Transformer block to hardware instructions to optimize performance, covering key decisions, mapping techniques, performance modeling, and Transformer-specific considerations.

Tab 2. Table from the paper “Full Stack Optimization of Transformer Inference: a Survey”. State-of-the-art DNN schedulers for heterogeneous accelerators.

Performance models can provide feedback on different mappings without executing them on real hardware or running simulations on accelerators. There are different types of performance models, including domain-specific polynomial and analytical models, and data-driven ML models. However, the generated mappings may not perform optimally on actual hardware, so cycle-exact software models and FPGA emulation are used for higher fidelity. Code generation tools are then used to implement mappings onto hardware or simulators, but they are difficult to adapt for a codesign framework where both the mapspace and hardware target can vary. User-schedulable languages such as Halide, TVM, Rise/Elevate, and Exo have been developed to address this problem, allowing users to customize the hardware instruction set and convert schedules into executable code.

Previous research has primarily focused on mapping convolutional neural networks (CNNs) onto hardware, but the same mapping strategies can be applied to Transformers since they also rely heavily on matrix multiplications. However, Transformers include LayerNorm and Softmax operations, which can make scheduling more complex and impose constraints on the order of matrix multiplications. This creates a more challenging problem for optimizing scheduling in Transformers, and this paper explores the mapspace of Transformer blocks and the scheduling complexity in more depth.

The study empirically characterizes the search space of legal mappings for Transformer and CNN architectures, using BERT and ResNet50 as representative models, respectively. The Timeloop mapper was used to search for 100K random valid mappings, and the latency and energy were estimated using the Timeloop model. The target hardware architecture is the Gemmini systolic generator, and both BERT and ResNet50 models are assumed to have been 8-bit integer quantized, with an input sequence length of 512.

The study empirically characterizes the search space of legal mappings for BERT and ResNet50 architectures. The comparison between the mapspaces for BERT and ResNet50 is presented in Fig 12. The study observes that both BERT and ResNet50 mapspace have a similar range of potential energy-delay product values from randomly sampled mappings. The empirical cumulative distribution functions (CDFs) of the same set of 100K random mappings are compared in Fig 13. The analysis of EDP distribution from randomly sampled valid mappings indicates that BERT matmuls are as challenging to schedule as CNNs, and appropriate scheduling also applies to Transformer matrix operations.

Fig 12. Image taken from the paper “Full Stack Optimization of Transformer Inference: a Survey”. Comparison of the mapspace of (Left) BERT and (Right) ResNet50. Distributions of 100K randomly sampled valid mappings are shown. The study found that the mapspaces for BERT matmuls and ResNet50 convolutions are of similar vast size and shape, with no significant differences in their distributions of EDP. Therefore, brute-force or random search for BERT matmul scheduling is equally as challenging as it is for ResNet50 operators.

Fig 13. Image taken from the paper “Full Stack Optimization of Transformer Inference: a Survey”. Comparison of empirical cumulative distribution functions (CDFs) for Transformer matmuls and ResNet50 convolutional operations in the regime where near-optimal mappings are found. The results showed that the percentage of near-optimal mappings for BERT matmuls is similar to, if not smaller than, that of ResNet convolution kernels, indicating that finding optimal mappings for BERT matmuls can be as challenging as for ResNet convolution kernels. The 10th percentile values for each relative EDP distribution were also reported.

In an additional analysis of mapspace characterization, the study fixed the total number of MACs to enable a fairer comparison between the distributions of mapping results for Transformer and ResNet50 operators. They assumed the Transformer input sequence length to be 512 and the feed-forward network expansion ratio to be four times the hidden dimension size. Even after forcing equivalent numbers of MACs, the study found that the range of relative EDP values is similar between BERT matmuls and ResNet50 convolutions, which further highlights the complexity of the scheduling problem for matmuls in Transformer models. The corresponding pairs mapping distributions were plotted in Fig. 14 to clarify the comparison between synthetic BERT layers and actual ResNet50 convolutions.

Fig 14. Image taken from the paper “Full Stack Optimization of Transformer Inference: a Survey”. Distributions of random, valid mappings for ResNet50 operators against distributions for Transformer MHA and FFN matrix multiplications where the size of matmuls was tuned to ensure that the total MACs were equivalent. The MHA matmul dimensions were calibrated to 240 × 240 × 512, and similarly for the FFN 𝑊1 projection matmul, set to 480 × 120 × 512 for 𝑑FFN = 4 × 𝑑. Even after accounting for differences in the total number of MACs, the study found that the EDP distributions were still similar for ResNet convolution and BERT matmuls, implying that the mapspace of BERT matmuls exhibit as vast a range of relative EDPs as the mapspace of CNN convolution kernels.

The presence of non-linear operations such as LayerNorm and Softmax between matrix operations in Transformers makes the optimization of execution schedules challenging. Fusion-optimized scheduling, which fuses high-arithmetic-intensity matmuls with low-arithmetic-intensity normalization operations, can be an effective strategy, especially for quantized workloads. However, this approach may be counterproductive in memory-constrained edge devices. The skewed tile shapes required for fused matmuls and normalization operations reduce arithmetic intensity, reducing performance and increasing memory traffic. Fusion-optimized scheduling is evaluated for BERT matmuls on a weight-stationary systolic array Gemmini, with opportunities for overlapping computations identified.

The article discusses two scheduling strategies for Gemmini’s MHA block, non-fused scheduling and fusion-optimized scheduling. The left plot in Fig. 15 shows that fusing query × key matmuls with Softmax for each attention head reduces latency, but fusion-optimized scheduling significantly harms the execution latency of the 𝑊out projection matmul after fusing with the following LayerNorm. Doubling the accumulator SRAM size to 256kB alleviates this performance hit. The right plot in Fig. 15 investigates the impact of sequence length on fusion-optimized scheduling for the MHA block. The fusion of query × key matmul with Softmax reduces overall latency by 22%. In Fig. 16, the results on matmul and LayerNorm overlapping in the FFN 𝑊2 projection show that fusion-optimized scheduling consistently worsens total latency by 27%. Overall, scheduling for Transformer matmuls becomes more complex when targeting different styles of custom hardware designs, including the Gemmini accelerator.

Fig 15. Image taken from the paper “Full Stack Optimization of Transformer Inference: a Survey”. Impact of fusion-optimized scheduling for BERT’s Multi-Head Attention (MHA) on the Gemmini accelerator. Results show that hiding the Softmax latency improves combined matmul and Softmax latency by 78%. However, overlapping 𝑊out projection with LayerNorm can either hurt or improve total latency, depending on the accumulator size. Overall, fusion-optimized scheduling for both matmuls in MHA yields 23% and 52% latency improvements for accumulator sizes 128kB and 256kB, respectively. Additionally, overlapping the query × key matmul with Softmax improves latency by 22%. Overall, fusion of both MHA matmuls with nonlinear operation yields a 21% latency improvement.

Fig 16. Image taken from the paper “Full Stack Optimization of Transformer Inference: a Survey”. The impact of fusion-optimized scheduling for BERT FFN matmul on latency hiding of LayerNorm operation is evaluated. The study varies input sequence length from 512 to 4096 and shows that fusion-optimized scheduling hurts total latency by 27% in both cases. This highlights the importance of carefully evaluating the impact of chaining matmul and LayerNorm execution on systolic arrays, as the mapping constraints may outweigh the gains from latency hiding of nonlinear operations.

This section discusses the optimization of DNN architecture for a specific hardware platform using automated neural architecture search (NAS) techniques. It includes an overview of NAS, explores hardware-aware NAS methods for CNNs, discusses NAS methods specific to Transformer architectures, and provides a case study of applying NAS method for optimizing Transformer inference on a target hardware architecture.

The article discusses the importance of optimizing DNN architecture for hardware performance, as different variations of the same architecture can result in better performance. Various techniques have been proposed to optimize DNN architecture for specific hardware platforms, including MobileBERT, Lite Transformer, and SqueezeBERT. However, finding these optimized architectures is challenging due to the large search space, and automated neural architecture search (NAS) methods have been proposed.

A NAS framework has three main components: search space, search method, and evaluation method. The search space contains valid operations and their connectivity to define valid DNN architectures. The search method explores the search space and samples candidate architectures, while the evaluation method assesses how well the candidate architectures perform on unseen data. Fig 17. schematically shows these different components.

Exhaustive search is impractical, so efficient methods are used for exploring the search space and estimating performance. The section aims to provide a practitioner’s perspective on different methodologies for improving NAS.

Fig 17. Image taken from the paper “Full Stack Optimization of Transformer Inference: a Survey”. Illustration of the general structure of NAS frameworks. Candidate DNN architectures are generated from the search space using the search method. These architectures are then evaluated, and the results are used by the search method to guide the exploration of the search space in order to generate better architectures.

The search space for Neural Architecture Search (NAS) defines the valid architectures that the NAS framework can search for. The layer-wise search method, where each layer or operation can be searched independently, suffers from a large search space size that grows exponentially with the depth of candidate architectures, leading to degraded search efficiency and final performance. The cell-wise search method can alleviate this shortcoming by searching cells, which are blocks or modules that consist of multiple layers, rather than entire architectures. This significantly reduces the search time and search space, and has been widely adopted in follow-up works.

Efficient search methods are necessary for NAS)since the search space is usually too large for exhaustive search. RL-based methods and evolutionary search are two strategies used for NAS. RL-based NAS frameworks use a controller that takes actions of sampling DNN architectures, whose evaluation accuracy after training is used as a reward signal to refine the sampling policy. Evolutionary search initializes a population of different DNN architectures, which are then mutated, evaluated, and selected based on validation accuracy. DARTS proposes continuous relaxation of the search space, allowing efficient exploration and optimization of the search space through gradient-based optimization methods. Due to its search efficiency, gradient-based search has become a popular choice for many NAS frameworks.

Fig 18. Image taken from the paper “Full Stack Optimization of Transformer Inference: a Survey”. Comparison of different NAS search methods. (a) In RL-based methods, a controller is used to select architectures based on a policy that is strengthened by the evaluation results of the chosen architecture, which serves as a reward. (b) Evolutionary search-based methods begin by creating a population of architectures, which are then evaluated and selected based on their performance, and then mutated to form the next generation of architectures. (c) Gradient-based methods, such as continuous relaxation, train adjustable weights in combination with model parameters that are multiplied to each operation choice. The weights are trained during the training process and eventually converge to favor one operation over the others, approximating the architecture that was sampled.

One of the main challenges in NAS is the high training cost. To solve this, ENAS proposed weight sharing, which uses a supernet that is over-parameterized to reduce the cost of searching for good sub-networks. Single Path One-Shot NAS constructs a supernet by stacking choice blocks, and each sub-network is obtained by uniformly sampling one operation for each block. Once-For-All and BigNAS propose methods to train a supernet in a way that its sub-networks can achieve good accuracy without additional training. These methods are useful for quickly deploying sub-networks without expensive from-scratch training.

To evaluate candidate architectures, a metric is needed to rank their “goodness” on a validation dataset. Early NAS algorithms fully trained sampled architectures until convergence, which is not feasible for large datasets. A common approach is to discover an accurate cell architecture on a smaller dataset and then apply it to build a larger model for a larger dataset. However, some recent NAS work challenges this approach. Supernet-based NAS algorithms can be a good alternative because they require only a single iteration of supernet training, which can be performed directly on large-scale datasets without prohibitive compute requirements, and avoid the use of proxy tasks.

Hardware-aware NAS aims to optimize both accuracy and various performance metrics on target hardware platforms, such as latency, energy consumption, or memory usage. Most works consider FLOPs or the total number of parameters, but direct measurement of hardware performance is challenging. Hardware-aware NAS frameworks have been introduced to directly consider latency or use approximate metrics. The most straightforward way is to directly measure hardware performance and bring it as an additional optimization objective. Some methods incorporate operation-wise lookup tables, while others rely on lightweight prediction models that can quickly predict hardware performance numbers for a given DNN configuration.

Early works on NAS mainly focused on CNN models for computer vision tasks, but after the introduction of the Transformer architecture, several works started to explore NAS methods for more efficient alternatives in various tasks. The earliest NAS works for Transformers were primarily in the NLP domain, with Evolved Transformer being one of the earliest attempts to apply NAS for searching better Transformer architectures using an evolutionary search algorithm. Evolved Transformer adopts the cell-wise search space to search two cell structures, each containing a stack of multiple blocks with their own hyperparameters.

The challenge of NAS for NLP tasks is the longer training and evaluation time, and the lack of good proxy tasks. Evolved Transformer addresses this by dynamically allocating resources to more promising architectures and early stopping those that fail to achieve a hurdle fitness. Due to the large computational cost, weight sharing and supernet based NAS have become popular options, and HAT extends the Once-for-All scheme to train a single supernet for Transformer architectures. HAT is hardware-aware and directly optimizes for latency along with accuracy, allowing sub-networks to be sampled through evolutionary search and deployed immediately to target hardware devices without retraining.

The article discusses several NAS methods for Transformer architectures used in NLP and CV applications. These methods use supernet-based and weight-shared methodologies due to the high computational cost associated with training Transformer architectures. NAS-BERT proposes a NAS method for pre-training stage encoder-only BERT, while Primer searches for a more efficient decoder-only Transformer for auto-regressive language modeling. ViT architectures are also explored, and methods such as Autoformer, ViT-ResNAS, and Burgerformer extend Single Path One-Shot to the ViT search space. The article notes the limitations of supernet-based NAS and the need for better ways to balance flexibility and efficiency in NAS techniques.

The authors conduct a case study to demonstrate the benefits of applying NAS to Transformer inference on Gemmini, optimizing both accuracy and hardware costs such as latency and energy.

In this particular study, a 6-layer Transformer architecture is used as a baseline to train a language modeling model on the WikiText-2 benchmark. They then use NAS to optimize the model’s performance in terms of accuracy, latency, and energy consumption. They adopt the BigNAS-style strategy to train a supernet, and then an evolutionary algorithm to search for sub-networks. The search space includes different combinations of the number of layers, number of heads, hidden dimension, and FFN dimension. The authors use the energy-delay-product (EDP) as a single hardware cost metric, which combines latency and energy into one metric. They use a lookup table-based method to assess the latency and energy consumption of each sub-network on the target hardware, Gemmini, with a scratchpad size of 64 kB and an accumulator size of 256 kB. After the evolutionary search, the Pareto-optimal sub-networks are evaluated with an RTL simulator to obtain a more precise estimation of latency.

The study presents the results of the NAS Pareto-frontier for latency and energy in Fig 19, where each point represents a different Transformer architecture found from the evolutionary search algorithm. The baseline model is also shown for comparison, which is the largest Transformer architecture in the search space. The NAS framework allows for multiple Transformer architectures with better hardware cost to perplexity trade-offs. The authors select an architecture with the lowest EDP while having less than +0.1 perplexity loss, whose EDP is 3.6 × 109 and perplexity is 22.51. The architecture parameters illustrate the importance of a diverse search space, as the number of attention heads varies from 6 to 12 in each layer, and as the fully connected layer dimensions vary from 768 to 2560. Changing these parameters on a per-layer basis can lead to discovering more Pareto-optimal architectures compared to if they were fixed for every layer.

In Fig 19, it is shown that it is possible to achieve a significant reduction in latency and energy with just a slight increase in perplexity by using their co-design methodology. They achieved a 1.4× reduction in latency and a 1.6× improvement in energy with a 0.1 point perplexity degradation. Furthermore, they demonstrated that it is possible to reduce EDP by 2.2× with only a 0.1 point perplexity degradation and 10.6× with 1 point perplexity degradation. However, the authors note that results may vary depending on the target hardware and optimization goals, and their methodology was run only once on a specific hardware platform.

Fig 19. Image taken from the paper “Full Stack Optimization of Transformer Inference: a Survey”. The plot on the left shows the relationship between latency and perplexity, while the plot on the right shows the relationship between energy and perplexity. These plots depict the Transformer architectures discovered through evolutionary search on our optimal Gemmini hardware setup.

The Transformer architecture has greatly impacted natural language understanding and has been extended to other fields such as computer vision and speech recognition. However, the increasing size and complexity of Transformer models present challenges for efficient inference. While DNN accelerators can help, there is limited understanding of the run-time characteristics and design principles for Transformer workloads compared to CNN architectures.

This paper conducts a thorough analysis of Transformer workloads to understand their run-time characteristics and performance bottlenecks on different hardware and accelerators. It is also present a survey of the current hardware and software solutions and potential optimization opportunities in the full-stack deployment of Transformers, including the design of hardware architectures, optimization strategies such as pruning and quantization, mapping and scheduling of operations, and the use of automated NAS for designing efficient Transformer architectures.

This study analyzed Transformer workloads to understand runtime characteristics and identify performance bottlenecks. The analysis found that nonlinear operations in Transformers can have a significant impact on performance, and hardware design for Transformers may differ from that of CNNs. Matmul scheduling in Transformers is similarly challenging to convolution scheduling in CNNs, and fusing LayerNorms with preceding matmuls requires careful consideration. Case studies showed that co-design and co-optimization techniques across the stack can significantly improve full-stack Transformer inference performance. The result exhibited 88.7× EDP improvement without a noticeable performance drop compared to a naive implementation without full-stack considerations.

The paper discusses hardware design techniques used to avoid communication overhead associated with offloading unsupported operations to the host CPU, including dedicated normalization units to support Softmax and LayerNorm. This led to a 39.6× latency reduction. The paper also describes using NAS to search for Pareto-optimal Transformer architectures, resulting in 2.24× EDP reduction without a noticeable perplexity drop, and 10.56× EDP reduction with 1 perplexity drop. Overall, the paper aims to facilitate advancements in understanding Transformer inference and optimizing its efficiency for wider application.