Why Do Large Language Models Output Text Word‑by‑Word? Inside the Transformer Mechanics
This article explains the fundamental architecture of large language models, from the dual file nature of parameters and code, through neural network basics, perceptrons, and weight training, to the Transformer’s tokenization, positional encoding, self‑attention, and inference processes, illustrated with diagrams and examples.
Preface
Why does ChatGPT appear to type one word at a time? The seemingly human‑like output is not for show; it is a direct consequence of the model’s underlying implementation.
The Essence of Large Models
Former Tesla AI director Andrej Karpathy describes a large language model as essentially two files: a parameter file (the weights) and a code file that runs those parameters, typically written in Python.
The parameters constitute the neural network’s weights, while the code executes the network.
The next question is where the parameters come from, which leads to model training.
In essence, large‑model training is lossy compression of massive internet data (about 10 TB of text) requiring a huge GPU cluster.
For example, training a 70‑billion‑parameter Llama 2 model needs 6 000 GPUs for 12 days, producing a ~140 GB “compressed file” at a cost of roughly $2 million.
With this compressed file, the model forms an understanding of the world.
How Large Models Work
The model predicts the next word in a sequence using the compressed data encoded in its neural network.
For example, given the input "中华人民", the model predicts "共和国" with a high probability, then continues to predict "中华人民成立于1949年".
Neural Network
Neural networks are not as complex as they seem; they mimic the human brain’s network of neurons.
External stimuli are converted to electrical signals that travel to neurons.
Millions of neurons form the central nervous system.
The central system integrates signals and makes decisions.
The body acts on the central system’s commands.
Perceptron
The simplest neural network, invented in 1957, still in use today.
A perceptron takes multiple binary inputs (0 or 1) and produces a binary output. Example: deciding whether Zhang San should go to a movie based on weather, price, and a girlfriend.
Inputs are weighted (e.g., weather = 8, price = 4, girlfriend = 4). The weighted sum is compared to a threshold (e.g., 8) to produce the final decision.
During training, weights are adjusted using large datasets to improve prediction accuracy.
Play with Neural Networks
TensorFlow Playground (http://playground.tensorflow.org/) offers an interactive environment to experiment with simple neural networks.
GitHub: https://github.com/tensorflow/playground
Transformer Architecture (Deep Learning Model)
Most modern large models are based on the Transformer architecture, introduced in 2017 by Vaswani et al. Its core innovation is the self‑attention mechanism.
1. Vectors and Matrices
Recall high‑school concepts: vectors, vector addition, scalar multiplication, and matrices.
2. Transformer Diagram
3. Tokenization and Embedding
Input text is split into tokens, each mapped to a high‑dimensional vector (e.g., using OpenAI’s tiktoken library).
4. Positional Encoding
Since token order matters, sinusoidal functions (sin for odd positions, cos for even) encode positional information.
5. Self‑Attention Mechanism
Self‑attention lets each token consider all other tokens, computing attention scores via Query‑Key‑Value matrices and producing weighted sums.
6. Normalized Attention Scores
Normalization (e.g., softmax) scales attention scores to probabilities, stabilizing training.
7. Feed‑Forward Neural Network
After self‑attention, each token passes through a feed‑forward network (e.g., GPT‑3’s 12 288‑dimensional layers) to introduce non‑linearity.
8. Training and Inference
Training compresses massive data into model parameters; inference uses the trained model to generate outputs for new inputs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
