What Are Large Language Models Trained On?

How does an AI model learn to answer anything from casual questions to coding problems? It devours massive amounts of text. From Wikipedia to GitHub, these models are trained on diverse data sources that shape their abilities—and not all data is created equal.

When we talk about Large Language Models (LLMs) like GPT or BERT, a common question pops up: What are these models trained on? These models seem to understand everything from casual conversation to technical documentation, but how do they acquire this knowledge? Let’s break it down in simple terms.

The Basics: What Is Training?

Before diving into what these models are trained on, let's briefly explain the concept of training. In AI, "training" involves feeding a model vast amounts of data so it can recognize patterns, relationships, and structures in that data. LLMs learn to predict the next word in a sentence or to generate a coherent response based on context.

So, What Are LLMs Fed During Training?

The short answer: a LOT of text.

Here’s a rundown of the key types of data used to train LLMs:

1. Publicly Available Web Data:

LLMs ingest huge amounts of text from the web. This includes anything from Wikipedia articles to blog posts, news reports, product reviews, and forums. Public repositories like Common Crawl offer vast swathes of the internet that AI developers use to train their models. If it’s text and it's publicly accessible, chances are an LLM has processed it. (Common Crawl)

2. Books and Research Papers:

Academic papers, books, and research documents also make up a large part of the dataset. This includes everything from classic literature to highly specialized scientific research papers.

3. Coding Repositories:

Many LLMs, particularly those aimed at coding or technical questions, are trained on datasets like GitHub repositories. By learning from actual code, models like Codex (which powers GitHub Copilot) can help developers by suggesting code completions or entire blocks of code. (Read more about GitHub Copilot)

4. Dialogue Data:

Some LLMs are specifically fine-tuned using datasets of human conversations. This includes chat transcripts, customer service interactions, and forum dialogues. The idea is to teach the model how to handle back-and-forth conversations more naturally.

5. Miscellaneous Data (Manual Annotations, Specialized Corpora):

For specialized tasks, LLMs might be trained on highly curated datasets. For instance, sentiment analysis models might use labeled data where each sentence is tagged as "positive" or "negative" in tone.

Type of Data	Examples	Purpose
Web Data	Wikipedia, Blogs, News Sites	General language understanding, wide knowledge base
Books and Research Papers	Google Books, ResearchGate	Detailed, high-level knowledge on specific subjects
Coding Repositories	GitHub	Understanding programming languages and syntax
Dialogue Data	Customer Service Transcripts	Handling conversational tasks
Miscellaneous	Labeled Sentiment Datasets	Specialized applications like sentiment or intent detection

Is All Data Equal?

Nope! Not all data is created equal. High-quality, well-written content (like news articles or research papers) typically holds more weight during training than something like informal, poorly-written content found in random forums. The idea is to help the model learn patterns that align with how we use language in different contexts.

What About Sensitive or Biased Data?

This is where it gets tricky. Since LLMs are trained on such broad data sources, they can sometimes pick up and replicate biases found in the original data (racism, sexism, etc.). That’s why ongoing work is being done to ensure better filtering and ethical training of these models. (Learn more about bias in AI)

Conclusion: A Text Buffet for AI

In essence, training large language models is like feeding them an all-you-can-eat buffet of text data. The bigger and more diverse the training set, the more well-rounded and capable the model becomes. However, not all sources are treated equally, and model creators have to be careful about what they feed their AI. After all, you are what you eat—or in this case, what you train on.

‍

— Cohorte Team

October 22, 2024