Artificial Intelligence 22 min read

Why Do Large Language Models Output Text Word‑by‑Word? Inside the Transformer Mechanics

This article explains the fundamental architecture of large language models, from the dual file nature of parameters and code, through neural network basics, perceptrons, and weight training, to the Transformer’s tokenization, positional encoding, self‑attention, and inference processes, illustrated with diagrams and examples.

JD Cloud Developers

Jun 25, 2024

Why Do Large Language Models Output Text Word‑by‑Word? Inside the Transformer Mechanics

Preface

Why does ChatGPT appear to type one word at a time? The seemingly human‑like output is not for show; it is a direct consequence of the model’s underlying implementation.

The Essence of Large Models

Former Tesla AI director Andrej Karpathy describes a large language model as essentially two files: a parameter file (the weights) and a code file that runs those parameters, typically written in Python.

The parameters constitute the neural network’s weights, while the code executes the network.

The next question is where the parameters come from, which leads to model training.

In essence, large‑model training is lossy compression of massive internet data (about 10 TB of text) requiring a huge GPU cluster.

For example, training a 70‑billion‑parameter Llama 2 model needs 6 000 GPUs for 12 days, producing a ~140 GB “compressed file” at a cost of roughly $2 million.

With this compressed file, the model forms an understanding of the world.

How Large Models Work

The model predicts the next word in a sequence using the compressed data encoded in its neural network.

For example, given the input "中华人民", the model predicts "共和国" with a high probability, then continues to predict "中华人民成立于1949年".

Neural Network

Neural networks are not as complex as they seem; they mimic the human brain’s network of neurons.

External stimuli are converted to electrical signals that travel to neurons.

Millions of neurons form the central nervous system.

The central system integrates signals and makes decisions.

The body acts on the central system’s commands.

Perceptron

The simplest neural network, invented in 1957, still in use today.

A perceptron takes multiple binary inputs (0 or 1) and produces a binary output. Example: deciding whether Zhang San should go to a movie based on weather, price, and a girlfriend.

Inputs are weighted (e.g., weather = 8, price = 4, girlfriend = 4). The weighted sum is compared to a threshold (e.g., 8) to produce the final decision.

During training, weights are adjusted using large datasets to improve prediction accuracy.

Play with Neural Networks

TensorFlow Playground (http://playground.tensorflow.org/) offers an interactive environment to experiment with simple neural networks.

GitHub: https://github.com/tensorflow/playground

Transformer Architecture (Deep Learning Model)

Most modern large models are based on the Transformer architecture, introduced in 2017 by Vaswani et al. Its core innovation is the self‑attention mechanism.

1. Vectors and Matrices

Recall high‑school concepts: vectors, vector addition, scalar multiplication, and matrices.

2. Transformer Diagram

3. Tokenization and Embedding

Input text is split into tokens, each mapped to a high‑dimensional vector (e.g., using OpenAI’s tiktoken library).

4. Positional Encoding

Since token order matters, sinusoidal functions (sin for odd positions, cos for even) encode positional information.

5. Self‑Attention Mechanism

Self‑attention lets each token consider all other tokens, computing attention scores via Query‑Key‑Value matrices and producing weighted sums.

6. Normalized Attention Scores

Normalization (e.g., softmax) scales attention scores to probabilities, stabilizing training.

7. Feed‑Forward Neural Network

After self‑attention, each token passes through a feed‑forward network (e.g., GPT‑3’s 12 288‑dimensional layers) to introduce non‑linearity.

8. Training and Inference

Training compresses massive data into model parameters; inference uses the trained model to generate outputs for new inputs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Neural Network Transformer Large Language Model Tokenization Self-Attention

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.