Understanding Large Language Model Architectures
Introduction/overview
Key ideas
- Large Language Model architectures can be categorized by encoder-decoder models, encoder-only, decoder-only, and mixture of experts (MoE) models.
- Each architecture boasts unique strengths but isn't confined strictly to its optimal uses. You can improve and modify these models' capabilities well beyond their original design parameters through fine-tuning.
- Not all LLMs are Transformer-based models. MoE, for example, breaks down large language models into smaller “expert” models with diverse skill sets that are better at handling specialized tasks. This makes the models scale and perform better without computational bottlenecks.
Language models are at the heart of numerous daily applications, ranging from email filtering and web searching to machine translation and voice recognition. Large Language Models (LLMs) have propelled these applications to new heights, reshaping benchmarks across a wide spectrum of Natural Language Processing (NLP) tasks.
Yet, it's crucial to recognize that LLMs vary significantly in their make-up. While their architecture—primarily based on the Transformer model, as you learned in the previous lesson—significantly influences their performance, not every LLM is a Transformer. To be classified as an LLM, a model typically embodies the following characteristics:
- Massive scale: Boasting millions or even billions of parameters, LLMs capture intricate language patterns.
- Unsupervised pre-training: These models learn from extensive datasets of text and code through self-supervised methods, such as masked word prediction or sentence continuation, for a large repository of language knowledge. This approach may sometimes be supplemented with elements of supervised learning.
- Versatility and multi-task capability: LLMs adeptly handle diverse tasks, such as language translation (e.g., GPT-3 for creative text), question answering (e.g., BERT for contextual queries), and even generating code or poetry, demonstrating their wide-ranging utility.
- Adaptability and fine-tuning: They can be fine-tuned for specific tasks, improving their performance in particular domains or for specialized objectives.
This lesson will explore how distinct architectures better suit different tasks. To illustrate, we'll discuss four architectures: Encoder-Decoder, Encoder-Only, Decoder-Only, and Mixture of Experts (MoE).
How architectures shape the LLMs
Architecture in machine learning (ML) refers to a model's arrangement of neurons and layers. It's like a blueprint that outlines how the model will learn from the data. Different architectures capture different relationships in the data that emphasize specific components during training. Consequently, the architecture influences the tasks the model is proficient in and the quality of the output it generates.
Encoder-decoder
The Encoder-decoder consists of two components:
- Encoder - accepts the input data and converts it into an abstract continuous representation that captures the main characteristics of the input.
- Decoder - translates the continuous representation into intelligible outputs while ingesting its previous outputs.
The encoding and decoding process allows the model to handle complex language tasks through a more efficient representation of the data, which helps the model respond with coherence.
This dual-process architecture excels in generative tasks like machine translation (converting the same sentence from one language to another) and text summarization (summarizing the same key points in the text), where comprehending the entire input before generating output is crucial. However, it can be slower in inference due to the need to process the holistic input first.
LLM examples:
Encoder-only
Models like the popular BERT ("Pre-training of Deep Bidirectional Transformers for Language Understanding," 2018) and RoBERTa ("A Robustly Optimized BERT Pretraining Approach," 2018) use encoder-only architectures to turn input into rich, contextualized representations without directly generating new sequences.
BERT, for instance, is pre-trained on extensive text corpora using two innovative approaches: masked language modeling (MLM) and next-sentence prediction. MLM works by hiding random tokens in a sentence and training the model to predict these tokens from their context. In this way, the model understands the relationship between words in both left and right contexts. This “bidirectional” understanding is crucial for tasks requiring strong language understanding, such as sentence classification (e.g., sentiment analysis) or fill missing words.
But unlike encoder-decoder models that can interpret and generate text, they don't natively generate long text sequences. They focus more on interpreting input.
LLM examples:
Decoder-only
Decoder-only architectures generate the next part of the input sequence based on the previous context. Unlike encoder-based models, they cannot comprehend the entire input but excel at generating the next probable word. As such, Decoder-Only models are more “creative” and “open-ended” in their output.
This token-by-token output generation is effective for text-generation tasks like creative writing, dialogue generation, and story completion.
LLM examples:
Mixture of Experts (MoE)
MoE, adopted by models like Mistral 8x7B, diverges from traditional Transformer models and builds upon the observation that a single monolithic language model can be decomposed into smaller, specialized sub-models. A gating network that distributes tasks (such as switching input tokens) among the models coordinates these sub-models, which concentrate on various aspects of the input data.
This approach enables scaling (efficient computation and resource allocation) and varied skills, which makes MoE great at handling complex tasks with varying requirements. The entire purpose of this architecture is to improve the number of LLM parameters without a corresponding increase in computational expense.
So, is Mistral 8x7B considered an LLM? Despite its architectural divergence from transformer models, it still qualifies as an LLM due to several reasons:
- Model size: Its enormous size and parameter count—187 billion parameters—make it comparable to other LLMs in complexity and capacity.
- Pretraining: Like other LLMs, Mistral 8x7B is pre-trained on comprehensive datasets through unsupervised learning techniques, allowing it to understand and mimic human-like language patterns.
- Versatility: It demonstrates proficiency across various tasks, exhibiting LLMs' broad range of capabilities.
- Adaptability: Like other LLMs, Mistral 8x7B can also be fine-tuned for specific tasks, enhancing performance.
Awesome! You have explored the architectures that shape LLMs and how they determine a model's capabilities. Let's transition to understanding the next crucial aspect of LLMs: the processes of training and fine-tuning these sophisticated models to achieve specific goals and high-performance levels.