Tag: llms

  • How Large Language Models (LLMs) Guess the Next Word—And Why That Matters

    “The cat in the …” — you already know the next word is hat. A large language model (LLM) does the same thing, billions of times per second, in dozens of languages. In this article you’ll learn how that guess actually happens and why the humble next‑word game underpins everything from code‑completion to customer‑support chatbots.

    Want a full expert-crafted lesson—complete with diagrams, easy examples, and real-world pros & cons? Download the full PDF for $4.99 »

    Diagram illustrating how a Large Language Model predicts the next word in a sentence, showing input "The cat in the" leading to prediction "hat" with 95% probability.
    Figure 1: How a Large Language Model (LLM) Predicts the Next Word

    1. From English to Numbers: Tokens Are the Currency of Text

    LLMs can’t read letters; they read numbers. The translation step is called tokenization. Instead of giving every English word its own ID—impossible when new slang pops up daily—modern models use Byte‑Pair Encoding (BPE). BPE starts with single characters and repeatedly merges the most common adjacent pairs until it has a manageable, fixed‑size vocabulary of “chunks.” In GPT‑2 that vocabulary is 50,257 tokens. A chunk can be a whole word (“cat”), part of a word (“tion”), or even punctuation (“,”).

    Because tokens are mere IDs, the model next looks up each ID’s embedding, a dense vector that captures semantic hints— for instance, in our example, cat‑like things cluster near each other in that high‑dimensional space. The embeddings of the sentence

    the ▶ cat ▶ in ▶ the

    become four numeric vectors ready for the next stage.


    2. Attention: Who Cares About Whom?

    Embeddings flow into a transformer—the architectural core introduced by Google in 2017. The transformer’s superpower is a layer called self‑attention. Attention lets every token ask every other token: “How much should I care about you when deciding my own meaning?”

    Take “The cat in the hat.”

    • “hat” ↔ “cat” gets a high attention score (they rhyme and co‑occur).
    • “in” ↔ “the” gets less; these are general words that don’t refer to each other.

    Self‑attention calculates those scores via simple dot products (matrix math that GPUs are highly optimized for). Each token ends up with a context‑aware vector richer than the one it started with.

    External reference: If you want the math, see “Attention Is All You Need” by Vaswani et al. (2017), Section 3.2.

    Multiple “heads” learn different relations. For example, one head might learn syntax, another might learn common rhymes, and a third head might make factual links. A generative pre‑trained transformer (GPT) contains dozens of these attention blocks, in addition to projection and Multi-Layer Perceptron (MLP) layers.


    3. The Guessing Game: Softmax & Temperature

    After all of the layers have processed the input, the last layer of a GPT is a softmax layer that makes a “guess” by indicating probabilities across the whole vocabulary:

    TokenProbability
    hat0.95
    house0.01
    cap0.04

    The model samples from that distribution.

    The temperature of a model informs how the model makes guesses. When temperature = 1.0, it might pick cap once in a blue moon; at temperature = 0 it always picks the top token, giving safe but sometimes bland prose.

    The chosen token is appended to the prompt, forming

    the cat in the hat

    and the loop begins again to predict the seventh token, eighth, and so on, until the model emits a stop token or reaches its context‑window limit (GPT‑4 Turbo: 128 k tokens).


    4. Why Context Size Feels Like Memory

    Humans hold roughly 20–30 seconds of linguistic context in working memory. Earlier GPTs mirrored that with 1,024 tokens (~750 English words). Newer models handle book‑chapter‑scale contexts, enabling in‑chat citing and long‑form reasoning.

    Yet context is not training‑time knowledge; it’s temporary. Anything older than the window evaporates. That’s why your AI assistant may “forget” details in a very long conversation.


    5. Where Training Fits In

    Guessing accurately relies on gradient‑descent training: the model reads billions of sentences, hides the last few tokens, and nudges its 100‑billion‑plus parameters whenever it guesses incorrectly. Over months of GPU time it internalizes English grammar, trivia, even programming patterns.

    But training never ends; fine‑tuning on niche corpora can teach an LLM medical jargon or legal precedent without rewriting its entire knowledge base. (That process is covered step‑by‑step on pages 16–20 of our premium lesson.)


    6. Putting It All Together

    1. Input: “The cat in the”
    2. Tokenize: [464, 1639, 287, 262] (example vocabulary IDs)
    3. Embed: Four high-dimensional vectors
    4. Self‑attention: Vectors exchange information—“cat” becomes aware of “the”, etc.
    5. Stacked layers: Patterns compound; high‑level “noun phrase” emerges.
    6. Softmax: Probabilities across 50 k tokens; “hat” dominates.
    7. Sample: Model selects “hat” → output token 5023 (the example vocabulary ID for “hat”)
    8. Loop: Append “hat”, shift window, predict again or finish.

    Image metadata

    • Filename: llm-next-word-flowchart.png
    • Alt text: “Flowchart showing eight steps from raw text to predicted next token in an LLM”

    7. Practical Payoffs of a Simple Trick

    Behind each of these is a probabilistic guesser conditioned on context (an LLM!):

    • Chatbots feel conversational because next‑word predictions chain into paragraphs.
    • Auto‑complete in IDEs predicts your code one token at a time.
    • AI‑generated images start with a text prompt encoded token‑by‑token.
    • Search engines now rewrite queries (“people also ask”) via the same mechanism.

    8. Common Misconceptions

    MythReality
    “LLMs think like humans.”They detect patterns and probabilities; no inner monologue needed.
    “Bigger models just memorize.”They generalize by compressing probability distributions; rote copying is penalized in training.
    “Temperature = 0 makes answers factual.”Low temperature reduces randomness of the LLM output, not factual errors. Source grounding and retrieval help with accuracy.

    9. Ready to Go Deeper?

    You’ve seen the bird’s‑eye view. The full Tech In 15 LLM lesson unpacks:

    • byte‑pair‑encoding math with worked examples
    • multi‑head attention visualizations
    • gradient‑descent training loop
    • pros & cons of commercial deployment

    Download the 22‑page PDF (readable in 15 minutes) for just $4.99!

    👉 Grab the lesson here

    (Instant digital delivery, 100 % money‑back guarantee if you don’t learn something new.)


    10. Frequently Asked Questions

    Is this the same technology behind ChatGPT?

    Yes. ChatGPT, Microsoft Copilot, Claude, and others are all large language models that operate via next‑token prediction inside a transformer architecture.

    Is a small context window still useful?
    Yes—most tasks like short Q&A or summarizing don’t need more than 8,000 tokens (~6,000 words). You only need a larger context window if your task depends on remembering information from much earlier. But bigger isn’t always better—if the model has too much to sift through, it might miss what matters.
    What makes LLMs “large”?

    “Large” refers to the number of parameters in the model—often billions or trillions—and the size of the dataset used for training. This scale enables the model to generalize across many tasks but also makes training expensive.

    How long does it take to train a model like GPT?

    Training a model like GPT-3 can take weeks using thousands of high-end GPUs. The cost can reach tens of millions of dollars, depending on compute and dataset size.

    Is next word prediction the only thing LLMs do?

    At a low level, yes—but chaining predictions together creates surprisingly rich capabilities, including coding, summarizing, and chatting. That’s the power of scale and training.

    Can LLMs really “understand” language?

    They don’t “understand” in the human sense, but they model patterns in language so well that it feels like understanding. They work by recognizing statistical relationships, not consciousness.

    Why does the same prompt sometimes give different results?

    LLMs are probabilistic. Unless temperature is set to 0, they may choose different valid next tokens each time—like rolling weighted dice. This can make them creative, but also less predictable.