Why Do Neural Networks Suddenly ‘Grok’ After Long Training? Insights from Google
Google’s recent research reveals that when small neural networks are trained for extended periods on tasks like modular addition, they can abruptly shift from memorizing training data to genuinely generalizing—a sudden “grokking” phenomenon driven by weight decay and the emergence of periodic weight structures.
Grokking Phenomenon in Neural Networks
In 2021 researchers discovered that tiny models, after long training, transition from merely memorizing the training set to exhibiting strong generalization. This sudden emergence is called grokking .
Experimental Setup: Modular Addition with a Single‑Layer MLP
Google’s study trained a single‑layer MLP with 24 neurons on the modular addition task (a + b mod n). The task’s output is periodic, allowing a simple geometric representation. Visualizations of the model’s weights show initially noisy values that become periodic as training progresses, indicating the network is learning an underlying mathematical structure.
Detecting Generalization vs. Memorization
The researchers used a 01‑sequence task: the model predicts whether the first three bits of a 30‑bit random binary string contain an odd number of ones. If the model truly generalizes, it should rely only on the first three bits; if it memorizes, it will also use later bits. Training on 1,200 fixed‑batch sequences shows an initial rise in training accuracy (memorization) followed by a sudden jump in test accuracy, marking the onset of grokking.
Why Does Grokking Occur?
Grokking is highly sensitive to hyper‑parameters such as model size, weight decay, and dataset size. Insufficient weight decay leads to over‑fitting, while excessive decay prevents learning. When weight decay is appropriately balanced, the loss initially rises slightly as the network trades off fitting the correct label for smaller weights. Eventually, a rapid drop in test loss coincides with the pruning of the last weight that connects to later bits, producing sudden generalization.
When Does Grokking Happen?
The phenomenon appears only for certain combinations of hyper‑parameters. The authors trained over 1,000 models with varied settings; only two clusters (shown in blue and yellow) exhibited grokking. This demonstrates that grokking is not guaranteed and can disappear if the conditions are not right.
Modular Addition with Five Neurons
Using an embedding matrix based on cosine and sine functions, a five‑neuron MLP perfectly solves the modular addition task. After training, all neurons converge to similar norms and are uniformly distributed around a circle when visualized via their cosine and sine components.
Frequency Analysis with Discrete Fourier Transform
Applying DFT to the trained weights reveals that only a few frequencies dominate the solution, mirroring the behavior observed in the 01‑sequence task.
Open Questions
Although the mechanisms for small MLPs solving modular addition are now clearer, many open problems remain: how to predict which regularization techniques will induce grokking, how larger models behave, and how to systematically interpret the sudden transition from memorization to generalization.
Future work may involve training simpler, highly biased models to uncover the principles governing large‑scale neural networks, potentially leading to more automated interpretability methods.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
