Definition and Evolution of LLMs
Introduction/overview
Key ideas
- Large Language Models (LLMs) like GPT and BERT use Transformer architecture to process and generate text. This way, they can understand the context and produce coherent responses based on extensive training on diverse datasets.
- Beyond Transformer architecture, other LLM concepts are parameters (that help the model learn during training), tokens (how the model understands words and characters), and context length (indicating the total number of tokens a model can process at one time).
- You can classify LLMs as generic (GPT, LlaMA), instruction-tuned (InstructGPT), dialogue-tuned (LaMDA), domain-specific (BloombergGPT), and retrieval-augmented language models
Large Language Models (LLMs) have rapidly become a major component of modern artificial intelligence that powers various applications, from ChatGPT by OpenAI to Google's BERT and LaMDA, Meta's Llama, and many others. These models excel at producing text that closely mirrors how you write, responding to prompts with insights, stories, explanations, and even humor. But what sets these models apart, and how do they operate with such versatility?
Throughout this module, you’ll learn what LLMs are and how to work with them. You'll discover how a revolutionary artificial neural network called a Transformer powers these models at their core. And we wouldn’t leave out understanding their architectural makeup, how you can train them, adapt them to various use cases, and evaluate just how good they are.
But first off, what are these LLMs?
What are large language models (LLMs)?
LLMs are artificial intelligence (AI) models trained on a massive corpus of text, enabling them to generate human-like language output. Say you want ChatGPT to write a haiku about the peace of morning dew or explain the workings of quantum mechanics; LLMs are so flexible that they can easily handle these requests.
It's crucial to recognize that while LLMs can produce text indistinguishable from a human's, they do not "understand" in the human sense. They mimic understanding based on patterns learned from their training data. So how do they work? Let’s get an overview in the next section.
How large language models work
You train LLMs on enormous datasets that include various text sources, such as literature, scientific articles, websites, and code repositories. This diverse training helps the model grasp a broad spectrum of language patterns and structures.
Below, let’s understand how LLMs function, from receiving a prompt to generating responses, and their practical applications in various industries.
- Receiving a prompt: When an LLM receives a query or prompt, it begins by analyzing the input using its vast knowledge base acquired during training. This prompt could range from a direct question to a more abstract request for creative text generation.
- Learning patterns: During training, LLMs learn to recognize and predict language patterns, including how words and phrases typically co-occur and how sentences are structured to convey meaning effectively.
- Understanding nuances: The training also enables LLMs to understand the nuances of different subjects to generate contextually relevant responses to the prompt.
- Generating responses: Based on the training, the LLM generates a response that matches the prompt's context. To do this, it chooses words and phrases that it anticipates will most likely follow the input.
The model's ability to generate coherent and contextually appropriate responses directly results from the scale and diversity of its training data, allowing it to cover nearly any topic. In the real world, LLMs are already making an impact. They're streamlining customer service through chatbots, aiding students in learning languages, and even assisting writers in overcoming creative blocks.
How did we get to this point of awesomeness?
Evolution of large language models (LLMs)
The formative years: birth of conversational AI
- [1960s] ELIZA: You can trace the inception of conversational AI back to ELIZA, created by Joseph Weizenbaum. This early chatbot mimicked human conversation by using pattern matching to respond to user inputs, laying the groundwork for future NLP applications.
Advancements in neural networks
- [1990s] LSTM networks: The introduction of Long Short-Term Memory (LSTM) networks marked a significant leap forward. LSTMs were designed to remember information for long periods, a crucial capability for processing sequences of words and enabling more coherent and context-aware text generation.
The data explosion era
- [2010s] Word embeddings: With the explosion of text data on the internet, techniques like Word2Vec and GloVe transformed NLP by representing words in dense vector spaces that enabled machines to grasp subtle language nuances and relationships.
Breakthroughs in model architecture
- [2018] BERT: Google's introduction of BERT represented a paradigm shift in understanding sentence context. By analyzing text bi-directionally, BERT achieved unprecedented performance in tasks like question answering and language inference, setting a new standard for LLMs.
The age of transformers
- [2020s] The transformer revolution: The advent of attention-based models, particularly the Transformer architecture, has been pivotal. Some examples of these models are GPT-3, Jurassic-1, Megatron-Turing NLG, and LLaMA. They use self-attention mechanisms to understand how words relate to each other in a sentence or document.
The current landscape
- Present day: Today, LLMs like GPT-4 and others are pushing the boundaries of what's possible in AI with tools that can write, converse, summarize, translate, and even generate creative content with a degree of sophistication that closely mirrors human intelligence.
What has powered this evolution?
The engines of evolution: data, computation, and algorithms
It's important to look at the factors that have shaped LLM evolution:
- Data: There is a vast amount of data in open-source repositories and digital libraries that cover everything on the internet.
- Computation: Breakthroughs in GPUs and AI chips have significantly sped up LLM training and allowed teams to scale up generative pre-training (an LLM training step you will learn in lesson 4).
- Algorithms: The transformer architecture and few-shot learning have been pivotal in improving the LLMs' contextual understanding with minimal data.
You’ve seen how LLMs have evolved and learned why they evolved. How about we understand what they are made up of?
Large language model (LLM) concepts
Here, we’ll take a look at four components that make up LLMs:
Transformer architecture
Transformers analyze the context of words in a sentence, not just their sequence (the order of words in the sentence). This enables the LLMs to generate coherent text (responses) regardless of the type of sentence or prompt. In lesson 3, you will learn more about the components of Transformers and how they work.
Parameters
These are the backbone of the LLM’s learning capability. They adjust through training to better model the complexities of human language. If you have ever heard of “Falcon 40B” or “LLaMA 2 7B," the ending refers to the number of billions of parameters learned by the LLM
Token
These are essentially units of text that can range from whole words ("tokenization") to subwords ("token" and "ization"), punctuation marks, or even parts of words. Tokenization splits text into these manageable chunks, with each token representing an element of the input text. It is fundamental for LLMs because it directly influences their ability to understand and generate text.
Context length
This determines how much of a conversation or text the LLM can consider when generating responses. Essentially, it's the total number of tokens—from both the input prompt and the model's output—that the model can process at any given time.
For instance, GPT-5 can handle a maximum context length of 32,000 tokens, allowing it to remember and use parts of a conversation spanning several paragraphs. In contrast, Claude 2 from Anthropic boasts a context length of up to 100,000 tokens, enabling it to analyze large books and provide detailed answers to specific questions. Learn more about the context lengths of different LLMs in this guide from AGI Sphere.
• Open-source LLMs are pre-trained LLMs that are free to use under a license (which could vary from commercial to research-only).
• Closed-source LLMs do not have open pre-trained weights; they only have an interface to interact with them.
Next, let’s learn what types of LLMs exist and where open-source and closed-source versions fit in.
Categories of large language models:
Generic language models
These are the broad-spectrum models trained on diverse texts to capture a wide understanding of language. Their strength lies in generating text that's coherent and contextually aligned with the given prompt without the need for specialized machine learning knowledge from the user.
Instruction-tuned language models
Evolving from generic models, you can “fine-tune” (tailor) these to follow specific instructions or prompts. They excel in tasks where text output must adhere to particular guidelines or formats. This makes them invaluable for content creation that requires adherence to explicit directives.
Dialogue-tuned language models
These models specialize in simulating conversational exchanges. Trained on dialogue data, they can maintain the flow of conversation, making them ideal for applications in customer service bots or virtual assistants that require a natural conversational experience.
Domain-specific language models
Tailored for expertise in particular fields, these models undergo fine-tuning with domain-specific data. Whether it's legal document analysis, medical text interpretation, or scientific research summarization, domain-specific models provide nuanced understanding and generation capabilities tailored to the unique needs of each sector.
Retrieval-augmented language models
Retrieval-Augmented Language Models (RALMs) combine the generative capabilities of LLMs with the ability to access and incorporate external information in real-time. This integration allows RALMs to dynamically retrieve relevant data from vast databases or the internet to enrich their responses with up-to-date, factual, and detailed information.
Phew! We covered a lot of points, and you’ve seen the types of LLMs—they all fundamentally rely on their building blocks. In the next lesson, you will learn about these building blocks. Head over there!