reSee.it Video Transcript AI Summary
The talk is an overview of building large language models (LLMs), focusing on practical components that matter for training and deploying them. LLMs are neural networks built on transformers, with five key components: architecture, training loss and training algorithm, data, evaluation, and the system/platform to run on hardware. The speaker emphasizes that academia often centers on architecture and losses, but in practice data, evaluation, and systems are the dominant concerns.
Pretraining and post-training are introduced. Pretraining is the classical language-modeling regime that trains a model to model the distribution of Internet text. Post-training (or alignment) turns these large language models into AI assistants, aligning them with user instructions and safe behaviors, a path popularized by ChatGPT. Both pretraining and post-training are discussed, with a focus on the non-architectural aspects.
Language models are probability models over sequences of tokens. Autoregressive language models factor the next-token probability as the product of conditional probabilities given the past context. Sampling entails predicting the next token, sampling from its distribution, and de-tokenizing. Training minimizes cross-entropy loss, equivalent to maximizing the log-likelihood of the observed token sequence.
Tokenization and tokenizers are crucial. Tokens go beyond words, accommodating languages without clear word boundaries and handling typos. Byte Pair Encoding (BPE) is highlighted as a common tokenizer method. Tokenizers are trained on large corpora: you start with characters as initial tokens and iteratively merge common adjacent token pairs to form subword tokens. The vocabulary size and tokenization impact the model’s performance and perplexity. Pre-tokenizers handle spaces and punctuation to balance efficiency with robustness. Tokens have unique IDs; a token can be reused across contexts, with meaning inferred by surrounding tokens by the transformer.
Evaluation basics include perplexity (exp of the average per-token cross-entropy loss) and standards like Helm, Hugging Face Open LLM leaderboard, and MMLU (a collection of question-answer tasks). Perplexity values have dropped dramatically over years, but perplexity is less used in academic benchmarking now because it depends on tokenizer choices and data. Evaluation challenges include inconsistent evaluation methods across organizations, test-train contamination, and the need for robust benchmarks. For many tasks, open-ended evaluation is hard; it’s common to constrain the model to pick among multiple choices or to measure likelihoods of correct answers.
Data is a central challenge. “All of Internet” is vague; Internet data is dirty and not representative. The data pipeline typically involves: web crawling (Common Crawl amounts to hundreds of billions of pages, about a petabyte of text), text extraction from HTML (removing boilerplate like headers/footers), filtering undesirable content (SFW, harmful content, PII), deduplication, heuristic quality filtering, and model-based filtering to bias toward higher-quality sources (e.g., Wikipedia references). Domain classification is used to upweight or downweight domains (e.g., code and books often upweighted, entertainment downweighted). End-of-training often includes a high-quality data pass (e.g., Wikipedia) with a small learning-rate to overfit on clean data. Data challenges include balancing domains, processing efficiency, copyright issues, and the scale of workforce and compute required. Data sizes for training open and closed models are vast: early academic benchmarks used tens to hundreds of billions of tokens; state-of-the-art closed models reportedly train on tens of trillions of tokens (e.g., Llama 2, Llama 3, GPT-4-scale estimates), with large compute demands.
Scaling laws are highlighted: larger models, more data, and more compute yield better performance in a predictable way. When plotted on log scales, test loss decreases linearly with increasing compute, data, and parameters, allowing extrapolation to plan resource allocation. Chinchilla-like experiments show the optimal balance among tokens per parameter under a fixed compute budget, offering guidance on whether to invest in bigger models or more data. Important takeaways: data quality and quantity, and data efficiency, are often more impactful than marginal architectural tweaks. The data step is extremely costly and central to practical success, and optimal resource allocation combines model size, data volume, and compute.
Post-training (alignment) aims to turn LMs into helpful AI assistants. The approach typically starts from a pretrained LM and fine-tunes it with human-provided data (supervised fine-tuning, SFT) to imitate desired responses. SFT data are collected from humans, often demonstrating the desired question-answer style. A notable development is Alpaca, where human prompts were used to generate many Q&A pairs, creating a larger dataset that a base model was fine-tuned on. The insight from data scaling in SFT (e.g., increasing from 2,000 to 32,000 examples) shows diminishing returns; SFT primarily teaches formatting rather than expanding factual knowledge. Reinforcement learning from human feedback (RLHF) introduces a reward signal from human preferences to optimize model outputs. The typical RLHF pipeline uses supervised fine-tuning, then trains a reward model on human judgements, and finally optimizes with PPO (policy optimization). PPO (and the practical complexities of RL) is compared to newer approaches like DPO (direct preference optimization), which maximizes the likelihood of preferred outputs and minimizes the likelihood of non-preferred ones, avoiding some PPO complexities. DPO is presented as simpler while achieving similar or better results in some contexts.
Human data challenges are discussed: labeling quality, annotator distribution shifts, and ethics, with humans often agreeing with each other only around two-thirds of the time on binary tasks. Costs are substantial, and people use a mix of human and LM-generated data to improve data collection. A notable development is using LLMs to generate preferences ( LM preferences ) to reduce labeling costs while maintaining alignment quality. The evaluation of post-trained models relies on human-based preferences (e.g., chatbot arena-style benchmarks) rather than standard validation loss or perplexity, because alignment changes the objective away from likelihood to human-preferred outputs. Correlations with human judgments are strong for some benchmarks, but there are concerns about biases (e.g., longer outputs being favored) and calibration issues.
Systems and hardware are essential. The bottleneck is compute, with throughput (not latency) being critical. GPUs excel at throughput via mass parallelism and matrix multiplications; memory and communication bottlenecks constrain scaling. Techniques to improve efficiency include mixed precision (12/16-bit computing), where weights stay at 32-bit precision while computations use lower precision, and operator fusion (fusing multiple operations into a single kernel) to reduce data movement. PyTorch optimizations like torch.compile can yield substantial speedups by compiling to fused kernels. Other topics like tiling, mixture of experts, and deeper system-level optimizations are acknowledged but not detailed.
The talk closes with pointers to courses for deeper study: CS 224n (large LM background), CS 324 (large language models in depth), and CS 336 (large language model from scratch, building an LLM). The overarching message is that data, evaluation, and systems are the keys to practical, scalable LLM success, with architecture differences often playing a smaller role in practice than the way data and compute are managed.