Analyzing 1.4 Billion N‑gram Rows with Python, NumPy and PyTubes
This article demonstrates how to download Google’s massive N‑gram dataset, load the 1.4 billion 1‑gram records with Python and the PyTubes library, use NumPy to efficiently compute yearly word frequencies, and reproduce Google Ngram Viewer charts for Python and other programming languages.
The Google Ngram Viewer visualizes word usage over time using a huge corpus of scanned books; the author reproduces the Python‑keyword trend by downloading the public n‑gram dataset (covering the 16th‑century to 2008) and processing it with Python.
Because the 1‑gram files total 27 GB on disk (about 1.43 billion rows across 38 files), the author leverages NumPy’s fast array operations and a new data‑loading library called PyTubes to read the tab‑separated data on a modest 8 GB, 2016 MacBook Pro.
After extracting the three fields (word, year, count) and filtering for the capitalised form “Python”, the script builds a NumPy array of the relevant rows, computes yearly total word counts, and derives the percentage of “Python” occurrences per year.
Using these percentages, the author recreates the Google Ngram chart for Python, then extends the analysis to compare three programming languages—Python, Pascal, and Perl—by normalising counts to a common baseline (1800‑1960 average) and plotting their relative trends.
The author notes performance differences (Google’s chart renders in ~1 s versus the script’s ~8 min) and suggests future improvements to PyTubes, such as supporting smaller integer dtypes, richer filtering combinators, and enhanced string‑matching utilities.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.