Do Language Models Learn Language in the Same Stages as Children? An Analysis of GPT‑2 Developmental Trajectories
This article reviews a study that compares the stage‑wise language acquisition of infants with the learning trajectory of GPT‑2, using linguistic probes and statistical tests to determine whether deep language models follow sequential or parallel learning patterns similar to children.
Preface
Note: This article avoids heavy formulas and mathematical derivations, so beginners or readers with weak math backgrounds can read it with confidence.
Paper link: https://arxiv.org/pdf/2306.03586.pdf
1. Overview
During language acquisition, children learn in a fixed sequential order : first phoneme classification, then vocabulary, and finally increasingly complex syntax.
For example, a smart child learning English first distinguishes vowels and consonants, then more complex phonemes, then basic words like "apple" and "dog", and finally constructs complex sentences.
However, the reasons behind this sequential learning are largely unknown, and it is unclear whether the same pattern can be applied to computers. To investigate, the authors compare deep language models with children’s learning trajectories.
The authors use GPT‑2 (chosen for training convenience and reduced compute cost) to test whether its training exhibits stages comparable to those observed in children aged 18 months to 6 years.
2. Infant Stage‑wise Language Learning
What is "stage‑wise"? In the first year, infants acquire prosodic contours, phoneme categories, and common words.
Beyond these, true language also requires:
Syntax (e.g., "boy sings") – around 12 months;
Questions (e.g., "What sound does a cow make?") – around 30 months;
Nested syntax (e.g., "I see the boy who sings") – around 42 months.
Although the exact ages may vary, the order is consistent, constituting the so‑called "stage‑wise" pattern.
Measuring infants’ language skills is difficult, so researchers often rely on implicit methods such as eye‑gaze and sucking rate, which can be noisy for very young infants.
3. Are Language Models Like Infants?
Interestingly, this noise does not affect modern language models . Deep learning architectures trained to predict words from context have proven highly effective at learning natural language.
Unlike humans, these algorithms can be probed at any training step without interfering with the learning process.
High‑performance deep networks have been shown to implicitly or explicitly learn syntactic structures and to use features such as concreteness and lexical categories.
Crucially, recent deep networks represent vocabulary and syntax in ways similar to adult brains , suggesting that children and language models may share comparable learning trajectories, offering a valuable framework for understanding the computational principles of language acquisition.
The authors pose three questions:
Do these models learn language skills in a systematic order?
Is the trajectory sequential or parallel?
Does the trajectory resemble that of children?
We first outline the authors' methodology, then provide detailed explanations.
The authors trained 48 GPT‑2 instances from scratch and evaluated them on 96 grammatical probes from the BLiMP, Zorro, and BIG‑Bench benchmarks at each training step, comparing a subset of probes with the behavior of 54 children aged 18 months to 6 years.
What are BLiMP, Zorro, and BIG‑Bench? What are grammatical probes?
3.1 Three Benchmark Suites
BLiMP, Zorro, and BIG‑Bench are standardized test datasets used to compare model performance across a variety of linguistic phenomena.
3.2 Grammatical Probes
Zero‑shot linguistic probes are well‑designed sentences or phrases that test whether a model has acquired a specific language skill without additional training.
A probe compares the estimated probability of a grammatically correct sentence with that of a matched incorrect sentence.
Accuracy is defined as the proportion of cases where the correct sentence receives higher probability.
Probabilities are obtained by summing the log‑loss of the softmax layer for each sentence in the pair pair<correct grammar, incorrect grammar> :
f is the model's softmax layer;
X_g and X_u are the correct and incorrect sentences;
n_g and n_u are the token counts of each sentence.
If the model reliably distinguishes the sentences, it is considered to possess the corresponding language ability.
For example, to test subject‑verb agreement, we might use:
The cat is sleeping on the bed.
The cat am sleeping on the bed.
The first sentence is grammatical, the second is not; the model should assign higher probability to the first.
4. Diving Deeper
4.1 Sequential vs Parallel Learning
As shown in the figure:
Skill performance (y‑axis) vs. training steps (x‑axis);
Three tasks distinguished by color;
Two agents: Agent 1 (child) and Agent 2 (model), with the model’s curve being smoother due to precise measurement.
Sequential and parallel learning can reach any performance threshold at the same training step.
Sequential learning : more complex skills are not attempted until simpler ones are mastered.
Parallel learning : all skills are acquired simultaneously but at different rates.
The authors also propose a null hypothesis that neural networks with different random seeds may learn skills in different orders purely by chance.
They apply a one‑way ANOVA to test whether learning speeds differ across probe groups; significant differences would indicate distinct learning strategies.
4.2 Evaluating Learning Trajectories
To compare model trajectories across language abilities, the authors define an “acquisition time” metric: the number of steps required for a probe to reach 90 % accuracy. They rank these steps for each probe and compute rank correlations between models, averaging the correlations.
Statistical significance is assessed via a permutation test: model rankings are randomly shuffled 1,000 times, and the resulting correlation distribution is compared to the observed average correlation. A p‑value < 0.001 indicates that model trajectories are significantly similar.
4.3 Evaluating Children’s Language Skills
Friedmann et al. studied 54 Hebrew‑speaking children (18–71 months) across 11 linguistic phenomena, grouped into three stages: simple SV sentences, questions, and relative clauses.
Children’s spontaneous speech was recorded at home and manually annotated; a phenomenon was considered acquired only if it appeared in the child’s utterances.
5. Results
At the end of training, 64 probes (66 %) reached 50 % accuracy across all models. The GPT‑2 Large pre‑trained version achieved 50 % accuracy on 93 of 96 probes.
5.1 Models Exhibit Systematic Learning Trajectories
Probes are ordered by “average acquisition time” showing a clear progression of skill acquisition.
These results indicate systematic learning trajectories across models.
5.2 Cross‑Task Learning Is Parallel
When probes are grouped by difficulty (easy, medium, hard), all three groups show positive derivatives in the first 300 training steps, but with distinct learning rates (p < 10⁻²³), confirming parallel acquisition.
5.3 Comparison with Children
The order in which models acquire the three probe groups roughly matches the order observed in children.
5.4 Models Use Both Grammar and Heuristics
For example, in the sentence “Li Hua’s cat is hungry,” the subject‑verb agreement is grammatical. Models achieve high accuracy partly due to genuine syntactic rules and partly due to heuristic cues; under mismatched conditions, heuristics dominate.
6. Discussion and Conclusion
The similarity between model and child learning orders suggests that language models and humans may share underlying factors such as frequency of linguistic phenomena and intrinsic complexity.
Future work should control these factors to better understand the learning process.
Recent studies show that child‑directed training data can bring transformer models to probe accuracies comparable to large pre‑trained models, and GPT‑2 representations become increasingly similar to adult brain representations, answering the earlier question that both end up with adult‑like language competence, and the early stages are also largely alike.
Nevertheless, substantial research is needed to pinpoint the exact similarities and differences between model and child language acquisition.
The findings also revive the debate between innatism (language as an innate faculty) and empiricism (language learned from experience), with modern language‑model research offering insights into both perspectives.
Innatism posits an innate language faculty; empiricism argues language is acquired through statistical learning from environmental input.
The observed parallels hint at possible inherent hierarchical language structures that both humans and machines must acquire via inductive biases or data characteristics.
While these hypotheses remain unproven, studies like this provide a clear roadmap for future investigation.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.