Why Data Matters in your AI Strategy: GenAI vs. NLP
The emergence of LLM-powered chatbots like ChatGPT, Gemini and Copilot have significantly raised awareness about the potential of the transformer architecture, especially as it becomes increasingly multi-modal, incorporating computer vision and audio/video. This is continuing the rise of Generative AI (GenAI) that began with Stable Diffusion for image creation. While there is a surge in interest in AI driven by their user-friendly nature, it is important to consider whether they align with your use cases or if there are better alternatives. Although this applies more broadly here we will focus on the use of LLMs for NLP, particularly for extracting information from text.
Understanding Transformer Variants
It is important to note that generative pretrained transformers (GPT) style models, such as ChatGPT, are not the only LLMs. BERT (Bidirectional Encoder Representations from Transformers) is an earlier model from 2018 with many derivatives. As you may be able to tell from their names these model architectures have very different purposes, with GPT-style models being generative (also known as decoders), while BERT-style models are encoders, which extract information from text. This implies that they understand language in very different ways, even though they are trained with similar data. You might expect decoder models to perform better on generative tasks, while encoder models excel at other tasks. However, the dividing line is not so clear1. There are also encoder-decoder models (the original transformer architecture similar to seq2seq models), such as T5 which excel at tasks such as translation but will not be discussed here. We also focus on LLMs in particular as the use cases for NLP are more blurry. For example, the distinction is clearer in computer vision, where we don’t expect a generative model such as Dall-E to understand images for us (although the generation of captions is a good example of the increasingly multimodal nature of these models).
Generative vs Extractive (Decoding vs Encoding) Models
As previously stated, decoding and encoding models (which I will refer to as decoders and encoders for simplicity) process text in different ways. One way to conceptualise the two is that encoders transform information into a format that computers can understand, while decoders are the inverse, converting the mathematical representations these models use back into something we can read.
Decoders are generative, so take an input, a prompt that could be a question, with or without context, and essentially continually predict the next word (technically a token). These models are very large and trained on huge amounts of data, so they contain a lot of information that helps in these predictions, selecting the statistically most likely word (or whether to stop). As these models got larger, it became clear that they were capable of reasoning and taking information from the prompt (in-context learning). This emergent property has allowed for their rapid deployment, as they are flexible and easily adaptable. Asking for an answer in different styles is a good example of this. When interacting with ChatGPT or other similar services, it is important to note that they are usually set to not always choose the most likely word (temperature, top_p, top_k) to allow for more creativity in their responses. However, this also makes hallucinations more likely.
Encoders can be thought to be extractive, that is to say these models are designed to extract information from text. They are still trained on large corpora that let them contain a lot of knowledge but instead of generating new words they convert the text into vectors, which are context-dependent. In other words these models excel at understanding the meaning of words, sentences and so on and how they relate to each other. As the text has now been “encoded” into a model-ready format it can be used for many downstream tasks, such as named-entity recognition (NER), for example identifying cities, dates and so on in text, or part-of-speech tagging, as well as classification or sentiment analysis. Although, generally they extract information from the text, these models can also be used to extract answers from a question, though in a very different way to a chatbot.
Question and Answering from text (RAG)
The "internal" knowledge of any model type is limited to its training data. This means that it is unlikely to fit your company as well as it could and is also frozen in time. To overcome this, a technique known as Retrieval Augmented Generation (RAG) is used. Here, we retrieve text to be used by the model instead of - or in addition to - its internal knowledge. There are many retrieval techniques (for example, database or API calls generated by an LLM), but the most common technique is to use a “vector store” where documents are “embedded” using an encoder model (yes, even for GPTs). The question is then also embedded and similar documents found. This is both fast and highly scalable.
The difference comes in how the models then process this retrieved information (often called context) to answer a question. Decoders generate answers in a similar way to someone reading a passage, then putting it down, then answering the question. Although the passage will be fresh in their memory, they may make mistakes or forget information. This can be contrasted with encoders, which act more like someone highlighting relevant parts of the passage.
GPT-style models (decoders)
Decoder models work by continually predicting the next token, for example parts of words. This is based on what is statistically the most likely based on everything it saw when being trained (although it is common to not always choose the top prediction to allow for creativity). What will be generated depends highly on the prompt, or instructions. Encoder models work by understanding the whole document in one go (hence the term bidirectional in BERT) and can be trained to predict the start and end of any parts of the text that answer the given question.
As decoder models are primarily used as chatbots they are tuned to be conversational and so their predictions will usually be more verbose than encoder models. For example, if we have a piece of text:
France is located in Western Europe, bordering several countries including Belgium, Germany, and Spain. Its capital, Paris, is famous for its iconic landmarks like the Eiffel Tower and Notre-Dame Cathedral.
And ask “What is the capital of Paris?”
- a GPT-style model will likely answer with something like “The capital of France is Paris.” as it prefers complete sentences. Although you can often prevent this with prompt engineering it counters the ease of use expected by many of these models.
- A BERT-style model, as said above, will identify the start and end of the answer, and return the relevant section, i.e. “Paris”.
As we can see the two models work in very different ways so it is important to consider what you really want. If the goal is something more user-facing then the conversational style of a decoder may be preferable. Alternatively, if you want to pull out certain structured features (keys) in a document, such as names, dates or something more customised (known as key information extraction), then maybe a fine-tuned encoder model would give better results, especially as it runs both faster and with less cost so would be preferable at scale. Similarly, for categorisation, while in-context learning with a decoder model is possible, it is not the optimal choice. A fine-tuned encoder model is faster and cheaper in the long run. However, it is not always so clear cut as can be seen below.
Trade-offs between the different models
ChatGPT and similar models have one clear advantage: their sheer size. This, combined with their generative nature, allows them to be very flexible via in-context learning. In contrast, BERT-style models would need to be fine-tuned to each task because of their smaller size. However, this means encoders are cheaper and faster to use (and also don’t need a GPU), even when you account for training the custom model. Encoders also have the potential to be more accurate due to their architecture, especially for similar model sizes. Unfortunately, this performance comes at the cost of usability, and the lack of reasoning that GPT-models seem to be able to do, allowing for more complex tasks.
If you decide that an encoder/BERT-type model is the best fit for your use case, you should know that there are still many use cases for GPT-style models. They are ideal for exploratory data analysis as well as generating data to train a BERT model as they are so quick to get started with. However, they are not always well suited to a specific business problem because they are generalists.
Final thoughts
As we showed for NLP, and question and answering specifically, it is not wise to assume that GenAI is the answer to everything. In some cases maybe a smaller or simpler model that isn’t an LLM will give you the results you need (and often in a more explainable way). While an end-user may like to interact with a chatbot for any step in between it really comes down to your specific use case. In other words, after deciding on the requirements of an AI system, we should satisfy them by putting data first, and not be driven by the desire to use certain models. In this way, and by defining our metrics for success, we will be able to achieve the goal in a more efficient manner.
1 Technically, BERT can be used for generation by MASKing a whole bunch of tokens but it is more of a research exercise.
Let us solve your impossible problem
Speak to one of our industry specialists about how Artificial Intelligence can help solve your impossible problem