Language models and generative AI, renowned for their capabilities, are a hot topic in the AI industry. Global researchers are enhancing their efficacy and capability. These systems, typically deep learning models, are pre-trained on extensive labeled data, incorporating neural networks for self-attention. They use various layers—feedforward, recurrent, embedded, and attention—to process input text and produce relevant outputs.

Mostly, large language models’ feedforward layers hold the most parameters. Studies show that these models use only a fraction of available neurons for output computation during inference.

This article introduces UltraFastBERT, a BERT-based framework matching the efficacy of leading BERT models but using just 0.3% of neurons during inference, specifically 12 out of 4095 in each layer. We’ll explore UltraFastBERT’s architecture, functionality, and results. Let’s begin.

Traditionally, a language model employs different components to equip itself with content generation capabilities including feedforward layers, recurrent layers, embedded layers, and attention layers. These components are responsible for learning to recognize patterns during training, and ultimately generate accurate output on the basis of the input texts. Each of these components have some parameters, and in language models, a bulk of these parameters is held by the feedforward layers. However, these feedforward layers do not utilize 100% of the neurons available to them to generate output for every input at interference time which leads to wastage of resources that increases complexity, computation time, and computational costs.

At its core, the UltraFastBERT framework is a variant of the BERT framework, builds on this concept, and replaces feedforward layers with faster feedforward networks in its architecture that ultimately results in the UltraFastBERT framework utilizing only 0.3% of the available neurons while delivering results comparable to BERT models with a similar size and training process, especially on the downstream tasks. Due to its design implementations, the intermediate layers in UltraFastBERT framework is exponentially faster,

Given a fast feedforward(FFF) network, and a feedforward(FF) network, each with n number of neurons, the time complexity of a forward pass in a feedforward network is O(n) whereas the time complexity is O(log2n) for a fast feedforward network, and the difference in time complexity is primarily due to the fact in a fast feedforward network, the neurons are organized into a balanced binary tree, and when the input is provided, the network executes only one branch of the tree conditionally. Furthermore, performing interference on a fast feedforward network results in CMM or Conditional Matrix Multiplication, in which the input rows dot with the natural weight columns individually, and the output of the previous dot-product operation determines the weight of the columns to proceed with. Resultantly, the network uses all the neurons only for a few inputs, and no input requires more than a few neurons to be handled by the network. The CMM dot product contrasts the DMM or Dense Matrix Multiplication that computes the dot product of all inputs with all the weight columns.

To sum it up, UltraFastBERT is a BERT-based framework that provides results comparable to state of the art BERT language models that

- Utilizes only 0.3% of the available neurons during the interference stage, and engages just 12 neurons out of a total of 4095 neurons for each interference layer.
- Delivers strong performance comparable to state of the art BERT models by implementing fine-tuning strategies on downstream tasks.
- Provides a native implementation of the CMM or Conditional Matrix Multiplication that forms the base for the fast feedforward network, and ultimately leads to 78x speedup in performance when compared to native optimized DMM or Dense Matrix Multiplication.

**Feed Forward Neural Networks**

A feedforward neural network is one of the most straightforward artificial neural networks that moves the information in only the forward direction, from the input nodes to the output nodes via hidden nodes. One of the main highlights of a fast forward neural network is that there are no loops or cycles in such networks, and they are simpler to construct when compared to RNN or Recurrent Neural Networks, and CNN or Conventional Neural Networks. The architecture of a fast forward neural network comprises three components namely input layers, hidden layers, and output layers, and every layer consists of units called neurons, and each layer is interconnected to the other with the help of weights.

The neurons present in the input layers receive inputs, and forwards it to the next layer. The amount of neurons in each input layer is determined by the dimension of the input data. Next up, we have the hidden layers that are not exposed either to the input or the output, and they are responsible for the necessary computations. The neurons in each hidden layer take the weighted sum of the outputs given by the previous layer, employ an activation function, and pass the result to the next layer, and the process repeats all over again. Finally, we have the output layer that produces the output for the given inputs. Each neuron in every layer of a fast feedforward network is interconnected with every neuron in the next layer, thus making FFF neural networks a fully connected network. Weights are used to represent the strength of connection between the neurons, and the network updates these weights to learn the patterns by updating the weights on the basis of the error occurring in the output.

Moving forward, there are two key stages in the working of a fast feedforward neural network: the feedforward phase, and the backpropagation phase.

**Feedforward Phase**

In the feedforward phase, the input is fed to the network, and it then propagates forward. The hidden layers then compute the weighted sum of the inputs, and introduce non-linearity in the model by passing the sum of the inputs through an activation function like ReLu, Sigmoid, and TanH. The process repeats all over again until the weights reach the output layer, and the model makes a prediction.

**Backpropagation Phase**

Once the model makes a prediction, it computes the error between the generated output, and the expected output. The error is then back propagated through the network, and the network uses a gradient descent optimization algorithm to adjust the weights in an attempt to minimize the error.

**UltraFastBERT : Model Architecture and Working**

The UltraFastBERT framework is built on the crammedBERT architecture, and the UltraFastBERT framework employs all the components of the crammedBERT framework except the nature of the intermediate layers. Instead, the UltraFastBERT framework replaces the transformer encoder in the feedforward networks contained in the intermediate layers of the crammedBERT framework with fast feedforward networks. The UltraFastBERT framework makes the following changes to the original feedforward networks.

- The framework gets rid of the difference between leaf, and non-leaf nodes by using the GeLu activation function across nodes, and equipping these nodes with output weights, and removing output biases in its entirety. Post this, the framework fixes the leaf size to 1.
- Finally, the framework allows multiple fast feedforward network trees in parallel by jointly computing the intermediate output layers. The framework manages to do this computation by taking a sum of individual trees, and then presents the sum as the intermediate output layer.

Moving along, in training, the UltraFastBERT framework follows the training procedure employed by the crammedBERT framework that includes disabling the dropout in pretraining, and using the 1-cycle triangular learning rate schedule. The model is then fine-tuned to maximize its performance on a wide array of tasks mainly of the GLUE benchmark for a total of 5 epochs.

**Interference**

Interference is an important part for a fast feedforward network, and these fast feedforward networks in themselves form a major chunk of large language models, and they are known for their exceptional acceleration potential. To understand this acceleration potential, let’s take an example of one of the most advanced language models, the GPT-3 in which the feedforward networks in every transformer layer consist of over 49,100 neurons. If trainable, a fast feedforward network(maximum depth of 15) could replace the original feedforward network. The introduced fast feedforward network will have over 65,000 neurons, but it will only utilize 16 of these neurons for interference, which amounts to roughly 0.03% of the neurons available to GPT-3.

**Algorithm and Compatibility**

The UltraFastBERT framework makes use of a recursive pseudocode algorithm for fast feedforward interference, and the algorithm is depicted in the image below.

Here, B represents the batch size, H represents the width of the input layers, and M represents columns. Another major cause of concern with the use of a Computational Matrix Multiplication approach is whether it makes the fast feedforward networks incompatible with the process that is already in use for Dense Matrix Multiplication and existing Deep Learning frameworks. Fortunately, the use of CMM does not affect the performance or introduces incompatibility, although it does increase the caching complexity.

It’s vital to note that as a part of the fast feedforward network, single-threaded Dense Matrix Multiplication relies on executing the MAC or Multiplication and Accumulation instructions, and resultantly, replacing DMM with CMM approach will benefit CPUs because fewer MAC instructions are needed to compute the layer output per element. Therefore, despite employing a conditionality that is usually associated with branching, the “neural branching” acts as an addition to the memory offset to relevant pointers in the framework. Therefore, in the UltraFastBERT framework, the instruction branch prediction is never fully engaged to facilitate the conditionality of the CMM, and only loads the relevant columns of the weight matrix individually. Furthermore, as the framework performs row-column dot products, the SIMD or single instruction multiple data vector parallel processing is still a good option to speed up the interference implementations for specific devices.

**UltraFastBERT : Performance and Results**

We will talk about the performance of the UltraFastBERT framework for fine-tuning as well as for interference tasks to analyze how the framework fares against state of the art language models.

**Fine-Tuning Results**

The following figure demonstrates the performance of various models on GLUE-dev test datasets. Here, N represents the number of neurons available to the frameworks for training, “Avg” represents the average score of all tasks.

As it can be clearly seen, the UltraFastBERT framework that has been trained on the A6000 GPU for over 24 hours manages to retain almost 96% of the predictive performance on GLUE downstream tasks when compared to the original BERT framework. Furthermore, it can also be seen that with an increase in the depth of the fast feedforward networks, the performance of the frameworks witness a decline, although the majority of performance degradation occurs only for the CoLa task. If the CoLa task is disregarded for a while, the UltraFastBERT framework returns a predictive performance score of about 98.6%.

**Interference Results**

In this section, we will compare the performance of several feedforward or fast feedforward networks on interference implementations, and these implementations are spread across three levels.

- In Level 1 implementation, the implementation is constructed using BLAS Level 1 routines namely scalar-vector product, and vector-vector dot products.
- In Level 2, the implementations make use of BLAS Level 2 routines namely batched scalar-vector product, and batched matrix-vector dot products.
- In Level 3, the implementations employ the non-batched BLAS Level 3 matrix-matrix multiplication approach, and although it is the fastest implementation available for feedforward networks, such implementations are not available for fast feedforward networks because the library does not support the vector-level sparsity of the Computational Matrix Multiplication.

Additionally, the UltraFastBERT framework deploys GPU implementations by using either custom CUDA or PyTorch kernels.

The above table, compares the performance of the UltraFastBERT framework with its predecessors, the BERT-based frameworks in terms of feedforward and fast feedforward layers where every column contains the relative inference Fast Feedforward over Feedforward implementation speedups when they are making use of the same linear-algebraic routine primitives.

However, it is worth noting that the speedups reported in the above table are meant for “fair comparisons” i.e both the fast feedforward and feedforward implementations make use of identical linear-algebraic routine primitive operations. Furthermore, on Level 1 and Level 2, the implementations of the fast feedforward networks are capable of performing the interference 48x and 78x quicker than the quickest feedforward implementation respectively.

**Final Thoughts**

In this article, we have talked about the UltraFastBERT, a variant of the BERT framework, builds on the concept that feedforward layers do not utilize 100% of the neurons available to them to generate output for every input at interference time which leads to wastage of resources that increases complexity, computation time, and computational costs, and replaces feedforward layers with faster feedforward networks in its architecture that ultimately results in the UltraFastBERT framework utilizing only 0.3% of the available neurons while delivering results comparable to BERT models with a similar size and training process, especially on the downstream tasks.

Due to its design implementations, the intermediate layers in UltraFastBERT framework are exponentially faster. Furthermore, the strong performance delivered by the UltraFastBERT framework is a proof that LLMs can deliver strong performance by engaging only a fraction of their parameters for individual interferences, as the UltraFastBERT framework utilizes only 0.3% of the available neurons during interference, and yet manages to achieve 78x speedup over interference times.