Training and Fine-Tuning Large Language Models
Introduction/overview
Key ideas
- The training process of LLMs, such as GPT-3, is computationally intensive, requiring extensive data and powerful hardware. It can sometimes lead to models that are excellent at language generation but may require further fine-tuning to follow specific task instructions accurately.
- Training LLMs starts with preparing extensive, diverse, high-quality data, then initializing a baseline model and running generative pre-training (with Transformers) to create a base LLM.
- Fine-tuning is crucial for turning a general-purpose LLM into a more task-oriented tool, like a virtual assistant capable of following precise instructions.
Large Language Models (LLMs) get their impressive abilities from extensive amounts of data used in their training process. While training LLMs adheres to the foundational principles of a traditional Machine Learning (ML) workflow, they are distinguished by their significantly larger scale, often involving billions of parameters (ever heard of LlaMA 2 70 billion? 😅) and requiring substantial computational resources. This scale introduces unique challenges and additional metrics for evaluation.
In addition, generating a usable model isn’t just about collecting data. A crucial subsequent step is choosing a foundational model, such as BERT or GPT, which is the starting point for further development. You can refine this base model by fine-tuning it to suit specific needs or expanding its training corpus with additional data relevant to the intended application.
Training LLMs might initially appear complex, but fear not! In the upcoming sections, we will detail each aspect, demystifying the process and guiding you through the intricacies of training these powerful models.
The large language model (LLM) training process
Over the past three lessons, we covered introductions to LLMs like GPT and Bard, their building blocks, and architectural differences. That’s great! You must wonder how we train these massive models to get incredible results. I've got you covered. Let's explore this process.
1. Data preparation
Data preparation, a foundational step in LLM training, starts with sourcing data from diverse origins like books, articles, web content, code repositories, social media posts, community forums, etc. Given the myriad available sources, you need effective strategies for capturing and organizing valuable data.
Following data sourcing, preprocessing focuses on the quality of your data. This includes cleaning, tokenizing, de-duplicating, and maintaining format consistency. For instance, in developing an LLM for legal applications, preprocessing might involve breaking down legal documents into standardized phrases or terms.
2. Initial training
Let’s take a model like GPT-3 as an example. Training starts with the initial state to build a simple baseline model (e.g., a Bigram model). At this state, your model (with random weights) essentially knows nothing about language and would produce gibberish if prompted (initial output).
3. Generative pre-training
The next step begins with a resource-intensive pre-training stage. As you learned in previous lessons, most LLMs use architectures like Transformer models because they efficiently process sequential data. Likewise, in the pre-training phase!
In this stage, you can extend that model by adding components of the Transformer architecture (encoder, decoder, etc.) that include the attention layer (for context). The initial model generates tokens that you can pass through the components.
Here, you train the model on the dataset with an unsupervised learning approach, where it learns to predict parts of speech, context, and language patterns without explicit labeling. The model learns by adjusting its internal parameters (weights) based on the input data.
The primary outcome here is a model that has developed a broad understanding of natural language. It becomes capable of understanding syntax, grammar, and, to a certain extent, the context of texts.
The pre-trained model is not specialized in any particular task but can understand and generate language. This generalization makes it versatile, but its knowledge may not always be up-to-date or completely accurate, as you have likely seen with many “state-of-the-art” LLMs 🫣.
• Only a few companies pre-train their LLMs—the Metas, OpenAIs, Anthropics, Mistrals, and Google. These companies have copious amounts of data and compute resources for LLMs. The rest of us? Maybe not.
• The ongoing cost of continuing to train a model to keep it up-to-date (especially for very dynamic use cases) will have you questioning its purpose.
• After spending a lot on training, your model will still need to beat the best base model out there.
Post-pre-training, your LLM is ready for more specialized fine-tuning based on specific tasks or domains. Let’s see how to accomplish that next.
Fine-tuning the base large language models (LLMs)
You’ve trained a base model that probably understands general language nuances and can generate text. How do you adapt the model to perform specific tasks more effectively? Enter fine-tuning! Why is it essential to fine-tune? In most cases, you’d probably not need this step, especially if you are not running domain-specific applications (say, a legal virtual assistant). But if you are, there’s a high chance that you need this step.
You can fine-tune your models on a task (say, a question-answering task) or data (a corpus of questions and corresponding answers). What do you need for fine-tuning?
- High-quality data - for the LLM to learn patterns accurately.
- Domain expertise to prepare high-quality data, select the right technique(s), tune the hyperparameters, and validate that the model performance meets functional (responds accurately) and operational (can work for you) requirements.
So, how do you fine-tune the LLM? Two techniques:
- Traditional fine-tuning: You are likely familiar with this one - freeze the model weights, add new layers, tweak the learning parameters, and start training.
- Parameter-efficient tine-tuning: Here, you are altering a small, specific part of the model's weights through “decomposition matrices.”. These matrices provide a compact but effective way to adjust the model's weights by fine-tuning specific parameters without retraining the entire network. This popularly preferred approach maintains the overall structure and knowledge of the pre-trained model. An example is LoRA (low-rank adaptation).
• It may get computationally expensive when you scale up the parameters.
• You may require an extensive amount of domain data to achieve good performance.
• The domain expertise requirement might be a hassle.
Okay, I get it. Fine-tuning seems challenging, but what if you don’t want to go through all that? It is time to engineer those prompts and use RAGs (retrieval augmented generation) in the next lesson.
There are two options for choosing a framework:
• High-level: These are frameworks that are easy to use with abstraction. For example, LangChain or Hugging Face Transformers provide an extensive library of pre-trained models with simpler API access, making them an excellent starting point for you.
• Low-level: Frameworks like PyTorch and TensorFlow offer deep customization options (tweak models at the architectural and compute levels) if you want to fine-tune model performance at a granular level.