New approaches with language models in Machine Learning

Author: Sofia Oliveira - Deeper Insights

Nowadays, there are vast amounts of unprocessed text, from articles online, to books and social media conversations. In order to derive insights from these, it is necessary to process them. However, language is full of ambiguities which can make it difficult for a machine to understand the meaning of sentences.

In order to improve the performance of systems that process this text, one can first create a model of the language, i.e., a model that can distinguish which sentences are valid (make sense) and which are invalid in a language. This knowledge can then be transferred and applied to other tasks.

There are many ways to create these language models, and depending on the type of techniques used, language models can be classified as statistical or neural. Statistical language models make use of statistical techniques, like n-grams, to learn the probability distribution, while neural language models use neural networks for this purpose. Currently, a lot of research is directed towards Transformer-based language models. The appearance of the Transformer architecture [1] and, subsequently, BERT [2] and GPT [3], both based on this architecture, marked a shift in the NLP community, as these models resulted in a large increase in predictive power for most language tasks. The main advantage of these models is that they are pre-trained before being applied to tasks. The pre-training is done in a self-supervised manner, with no need to manually label data. This allows the model to learn the language structure without expensive annotation. The pre-trained model can then be fine-tuned to downstream tasks using smaller sets of task-specific data.

Pre-trained language models can be fine-tuned for many tasks, but these are constrained by the type of pre-training. Autoregressive models, like GPT, learn to predict the next word in a sentence and are therefore useful for Natural Language Generation tasks. On the other hand, autoencoding models, like BERT, learn to predict missing words from noisy input, incorporating bidirectional context. These models are then used for text classification, token classification (e.g. Named Entity Recognition) and Natural Language Understanding. Sequence-to-sequence models, like BART [4], learn a representation of a text input and then output another text sequence, and can be used for Machine Translation and Abstractive Text Summarization. These last models have also been adapted to work for other tasks, by altering the downstream tasks to a text-to-text format and at times using different pre-training objectives [5, 6, 7]. For Semantic Textual Similarity, where the similarity between two texts has to be determined, there are adaptations of these architectures that generate sentence embeddings, which can then be compared using a similarity measure [8, 9, 10].

While the pre-train fine-tune paradigm has shown great performance in many tasks, it results in a different model for each task. As such, a recent research trend has been to create ever-larger language models, so as to have a single model that performs all tasks with very few or no examples [11, 12, 13, 14]. These have pushed the state-of-the-art of many tasks, but have a lot of parameters with the largest having over 500 billion parameters (for comparison, BERT large has 345 million parameters), making them impractical for business applications. In order to scale the number of parameters without increasing the computational requirements, Mixture-of-Expert models, like GLaM [17] and Switch Transformers [18], were proposed, where, even though the model has trillions of parameters, only a subset of them are activated during prediction.

With the increase in model size, the memory and computation requirements for the models become a challenge in deployment. A good solution to make the models smaller is to compress them, which can be achieved through weight sharing [19]; knowledge distillation, where a larger model is used to train a smaller one [20, 21]; quantization (reducing the number of bits of each parameter) [22, 23]; and through pruning (removing parameters after training) [24].

Another research area relates to long text sequences, for example whole documents. A model like BERT breaks up long sequences into different inputs, which prevents the model from accessing all the information needed for prediction. Models like BigBird [25] and LongT5 [26] attempt to solve this issue, by sparsifying the attention mechanism to reduce computational requirements.

The language models mentioned so far are all trained in a generic domain and therefore are appropriate for day-to-day language. For a more specific domain, however, these models will not have the necessary vocabulary to generate good predictions. To address this issue, models pre-trained in much the same way but on a specialised domain have been proposed. Examples are Legal-BERT [27] for legal documents, PubMedBERT [28] for the biomedical domain, FinBERT [29] for the financial domain and ClimateBERT [30] for climate-related text.

Another case where the models described above would not be helpful is multilingual domains. For problems in other languages, there exist models trained with data from each language, e.g. AlephBert [31] (Hebrew), BARTpho [32] (Vietnamese) and CamemBERT [33] (French). Alternatively, one can use multilingual language models which are pre-trained in unlabelled data from multiple languages. These can then be fine-tuned on a high-resource language (typically English) and perform zero-shot cross-lingual transfer learning, i.e., perform well in another language without seeing task-specific data from that language. Recent examples are XLM-E [34] and mGPT [35].

Finally, an important consideration when using these models is their bias and toxicity and their environmental impact. Since transformer-based language models are trained on web text, they parrot common human biases (e.g., gender, racial and religious bias). Therefore, an active line of research explores how to identify and mitigate these biases [36, 37]. Additionally, with our society's increasing concerns about our carbon emissions and climate change, there has been an attempt to calculate the carbon emissions of training language models [38, 39].

Deeper Insights is the UK's leading Data Science and AI Consultancy and Top 10 AI company by Forbes in 2022.

Find out more about Machine Learning and AI on our website and start your journey to better business decisions and operational efficiency with AI solutions.

Let us solve your impossible problem

Speak to one of our industry specialists about how Artificial Intelligence can help solve your impossible problem.