Analyzing 1.4 Billion Google Ngram Records with Python, PyTubes, and NumPy
This article demonstrates how to download the 1.4‑billion‑row Google Books Ngram dataset, load it efficiently with the PyTubes library, use NumPy for aggregation and percentage calculations, and visualize word‑frequency trends such as Python, Pascal, and Perl over centuries.
The Google Books Ngram dataset contains 1.43 billion 1‑gram records spanning from the 16th century to 2008, each line storing a word, its publication year, the number of occurrences, and the number of books containing the word. The data can be downloaded for free from the official Google Ngram page .
Because the raw files total about 27 GB, loading them directly in Python would be slow and memory‑intensive. The article uses the PyTubes library, which streams tab‑separated files and allows row‑wise filtering and transformation without loading the whole dataset into memory.
Loading the data (run on an 8 GB‑RAM 2016 MacBook Pro) is performed with the following code:
import tubes
FILES = glob.glob(path.expanduser("~/src/data/ngrams/1gram/googlebooks*"))
WORD = "Python"
one_grams_tube = (tubes.Each(FILES)
.read_files()
.split()
.tsv(headers=False)
.multi(lambda row: (
row.get(0).equals(WORD.encode('utf-8')),
row.get(1).to(int),
row.get(2).to(int)
))
)After about 170 seconds the variable one_grams_tube holds a NumPy array with roughly 1.4 billion rows, each row containing the fields Is_Word , Year , and Count .
To compute the total number of words per year, NumPy’s histogram function is used:
last_year = 2008
YEAR_COL = '1'
COUNT_COL = '2'
year_totals, bins = np.histogram(
one_grams[YEAR_COL],
density=False,
range=(0, last_year+1),
bins=last_year+1,
weights=one_grams[COUNT_COL]
)Only records from 1800 onward are kept to avoid the steep drop‑off before that year:
one_grams_tube = (tubes.Each(FILES)
.read_files()
.split()
.tsv(headers=False)
.skip_unless(lambda row: row.get(1).to(int).gt(1799))
.multi(lambda row: (
row.get(0).equals(word.encode('utf-8')),
row.get(1).to(int),
row.get(2).to(int)
))
)The resulting filtered array contains about 1.3 billion rows (only 3.7 % of the data is before 1800). The yearly percentage of the target word is then calculated:
word_rows = one_grams[IS_WORD_COL]
word_counts = np.zeros(last_year+1)
for _, year, count in one_grams[word_rows]:
word_counts[year] += (100 * count) / year_totals[year]Plotting word_counts yields a curve similar to Google’s own visualization, showing the rise of “Python” mentions over time. The article also compares the mentions of “Python”, “Pascal”, and “Perl” after normalising each language’s counts to percentages from 1800‑1960, and presents side‑by‑side images of the author’s results and Google’s original charts.
Performance notes: the Google service generates the chart in about one second, while the Python script takes roughly eight minutes on the same hardware. Pre‑computing yearly totals or indexing the first column in a database could dramatically reduce processing time.
Overall, the piece demonstrates that with NumPy, PyTubes, and commodity hardware, loading, processing, and analysing a multi‑billion‑row dataset in Python is practical.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.