The Origin of Large Language Models: A Historical Investigation of ULMFiT and Early LLMs
This article examines the historical roots of large language models, highlighting Jeremy Howard’s ULMFiT as a pioneering work, its influence on GPT‑1, and subsequent debates about which model truly qualifies as the first true LLM, supported by citations and expert commentary.
This article, originally published by 量子位 and edited by 梦晨, explores a heated debate in the tech community about who created the first large language model (LLM), sparked by a claim that Jeremy Howard’s project "llms.txt" had little impact.
Jeremy Howard, an honorary professor at the University of Queensland, former Kaggle CEO and co‑founder of fast.ai, is presented as the central figure; his work on ULMFiT is examined as a potential predecessor to modern LLMs.
In early 2018 Howard released the ULMFiT paper (ACL 2018), introducing a universal language model fine‑tuned for text classification. The method achieved state‑of‑the‑art results on six benchmark tasks, reducing error rates by 18‑24% and matching large‑scale models with only 100 labeled examples.
Subsequent researchers, including GPT‑1 author Alec Radford, acknowledged ULMFiT as an inspiration, and a review paper later described ULMFiT as the "last common ancestor" of all modern LLMs.
To determine whether ULMFiT qualifies as the first LLM, the article lists five criteria for a model to be considered a large language model: It must be a language model that predicts tokens rather than words. It must be trained via self‑supervised learning on unlabeled text. Its core task is next‑token prediction. It should adapt to new tasks without architectural changes (few‑shot/one‑shot capability). It must be generally applicable across many NLP tasks (classification, QA, parsing, etc.).
The article then reviews related models: the original Transformer (designed for machine translation and not fully general), CoVe (contextual word vectors trained with supervised translation), and ELMo (self‑supervised pre‑training but limited few‑shot ability). These are deemed insufficient to meet the LLM criteria.
ULMFiT itself is an LSTM model pre‑trained on WikiText using self‑supervised objectives. It can be fine‑tuned for a wide range of downstream tasks without changing the architecture, achieving SOTA performance at the time and demonstrating strong generality.
Compared with GPT‑1, ULMFiT lacks the convenience of transformer‑based fine‑tuning and the breadth of tasks, but its underlying principles are closely aligned. Howard later claimed his work created the first "general language model," a term that later evolved into "large language model."
Industry voices, such as Apple engineer Nathan Lawrence, view ULMFiT as a pivotal turning point in the evolution of LLMs. The article concludes that while the exact title of "first LLM" may remain debated, ULMFiT’s influence on subsequent models is undeniable.
References: ULMFiT: https://arxiv.org/abs/1801.06146 GPT‑1 paper: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Additional discussion links: https://x.com/jeremyphoward/status/1905763446840607164, https://thundergolfer.com/blog/the-first-llm
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.