Charting a New Path of Neural Networks with Transformers

A “transformer model” is a neural network architecture composed of transformer layers capable of modeling long-range sequential dependencies that are suitable for exploiting modern computing hardware to reduce model training time.

State-of-the-art machine learning and artificial intelligence (AI) systems have made significant technological advancements in recent years, alongside the growing interest and widespread demand for the technology. We have seen the general hype around AI fluctuate with media cycles and new product developments, with the buzz of implementing AI for the sake of implementing it fading as the companies strive to demonstrate its positive impact on business, with a focus on AI’s ability to augment, not replace.

The concept of transformer-based models is now emerging. There is speculation over whether Transformers, which have gained popularity in natural language processing (NLP), will be positioned to “take control” of AI, leaving many wondering what this approach can accomplish and how it could transform the pace and direction of technology. .

See also: How Businesses Can Integrate Natural Language Processing

Understanding Transformer-Based Models

A “transformer model” is a neural network architecture composed of layers of transformers capable of modeling long-range sequential dependencies, which are adapted to take advantage of modern computing hardware to reduce model training time.

Until recently, when scientists and machine learning developers were tasked with modeling sequences, such as transforming a series of English words into Spanish, Recurrent Neural Network (RNN) layers were the “go-to” approach. to” to encode the input sequence and decode the output sequence. Although RNNs have proven useful in identifying relationships between elements in sequences, some limitations limit their potential, such as the date they can be remembered in sequence. Thanks to research by Google and the University of Toronto in the published article “Attention Is All You Need”, the concept of self-attention has become a key ingredient in the transformer layers used in modern neural network models. .

Self-attention involves various projections of input data. It is often a sequence of words (represented as numeric vectors), each of which is projected into three new vectors: a ‘key’, a ‘query’ and a ‘value’. For a given word, say in language processing, its corresponding “query” is applied to the “key” of all other words in the sequence. This tells the model how relevant this other word is to this current word for the task at hand. By repeating this for each word in the sequence, you can produce a scoring heatmap of the relevance of different word combinations. This heat map is then used to control what information to do with each word that passes to the next layer of the network.

Let’s take an example of machine translation, where you are translating from German to English. If you have a German sentence in the future tense using the form “ich werde / I will”, the verb is sent at the end of the sentence, which does not happen in English. The self-attention mechanism in a transformer-based machine translation model will be much more capable of utilizing the dependency between these far word pairs than RNN-based models.

Encoder-only and encoder-decoder models

Transformer models are mainly used for two purposes: encoder-only models and encoder-decoder models.

An encoder-only model is best used when a developer wants to encode only a portion of the data into a compact list of numbers, often called embeddings that can be input into downstream models, like sentiment detection. Today, “Bi-Directional Encoder Representations from Transformers” (BERT) is the best-known encoder-based transformer model in the industry. BERT is trained using large volumes of unlabeled text data, with the goal of training being to predict words that have been masked in the input sequence. Such encoder-based models can be combined with other layers for classification purposes.

Alternatively, there are encoder-decoder models. These architectures are common in modern machine learning for specific use cases like machine translation. This comes into play when translating from languages ​​like French to English. The words are entered as the French language, which is processed through a sequence of “transformer layers”. The text is encoded in a compact digital form. Then the decoder part of the model takes the output of the encoder and sends it through a sequence of Transformer layers again to generate the English text, word by word, until it predicts the end of the sentence .

Will transformer-based models replace modern machine learning?

Transformer-based models have begun to have a significant impact outside of NLP. The so-called Vision Transformer (ViT) has achieved impressive accuracy in image classification tasks by treating small bits of an image as parts of a sequence like a phrase of words. They have also been extended to understanding video, for example, with approaches like timeSformer. TimeSformer uses self-attention to model relationships between patches in an image and with patches in other structures in the sequence. Transformer-based models are even now beginning to improve the accuracy of speech emotion recognition.

Despite their potential, transformer-based models have shortcomings. The self-attention part of transformer-based models has quadratic computational complexity for the input data size, which makes processing long sequences extremely expensive, in addition to being unsuitable for real-time processing. Imagine the challenges this poses, for example, with processing arbitrarily long phone conversations.

At the same time, concerns such as computational load and suitability for real-time stream processing are becoming increasingly active areas of research by this now very vibrant scientific community.

The field continues to be incredibly dynamic in developing new techniques and processes to overcome shortcomings. Previously, for example, fully connected deep-feedback networks could not effectively capture important features in images, leading to the widespread adoption of convolutional neural networks. With ongoing research, transformer functionality will continue to evolve as a key ingredient in modern machine learning models.

The future of transformer-based models

Interest in transformers is warranted because of widespread improvements in accuracy. These advances are likely to have less impact on the activation of new commercial applications. They will help existing and emerging apps improve at what they are designed to do and increase accuracy and adoption. Like many promising technologies today, building on the strengths of each creates the best possible results, similar to finding human-machine symbiosis in a workplace.

As we look to the future, it remains clear that organizations must not get lost in the hype of emerging technologies and overestimate their potential. The most promising path will be where organizations find synergies between processes, technologies and business goals to shape a cohesive and connected technology stack.

Comments are closed.