Big Data 9 min read

Analyzing 1.4 Billion Google Ngram Records with Python, PyTubes, and NumPy

This article demonstrates how to download the 1.4‑billion‑row Google Books Ngram dataset, load it efficiently with the PyTubes library, use NumPy for aggregation and percentage calculations, and visualize word‑frequency trends such as Python, Pascal, and Perl over centuries.

Python Programming Learning Circle

Jul 17, 2021

Analyzing 1.4 Billion Google Ngram Records with Python, PyTubes, and NumPy

The Google Books Ngram dataset contains 1.43 billion 1‑gram records spanning from the 16th century to 2008, each line storing a word, its publication year, the number of occurrences, and the number of books containing the word. The data can be downloaded for free from the official Google Ngram page .

Because the raw files total about 27 GB, loading them directly in Python would be slow and memory‑intensive. The article uses the PyTubes library, which streams tab‑separated files and allows row‑wise filtering and transformation without loading the whole dataset into memory.

Loading the data (run on an 8 GB‑RAM 2016 MacBook Pro) is performed with the following code:

import tubes
FILES = glob.glob(path.expanduser("~/src/data/ngrams/1gram/googlebooks*"))
WORD = "Python"
one_grams_tube = (tubes.Each(FILES)
    .read_files()
    .split()
    .tsv(headers=False)
    .multi(lambda row: (
        row.get(0).equals(WORD.encode('utf-8')),
        row.get(1).to(int),
        row.get(2).to(int)
    ))
)

After about 170 seconds the variable one_grams_tube holds a NumPy array with roughly 1.4 billion rows, each row containing the fields Is_Word, Year, and Count.

To compute the total number of words per year, NumPy’s histogram function is used:

last_year = 2008
YEAR_COL = '1'
COUNT_COL = '2'
year_totals, bins = np.histogram(
    one_grams[YEAR_COL],
    density=False,
    range=(0, last_year+1),
    bins=last_year+1,
    weights=one_grams[COUNT_COL]
)

Only records from 1800 onward are kept to avoid the steep drop‑off before that year:

one_grams_tube = (tubes.Each(FILES)
    .read_files()
    .split()
    .tsv(headers=False)
    .skip_unless(lambda row: row.get(1).to(int).gt(1799))
    .multi(lambda row: (
        row.get(0).equals(word.encode('utf-8')),
        row.get(1).to(int),
        row.get(2).to(int)
    ))
)

The resulting filtered array contains about 1.3 billion rows (only 3.7 % of the data is before 1800). The yearly percentage of the target word is then calculated:

word_rows = one_grams[IS_WORD_COL]
word_counts = np.zeros(last_year+1)
for _, year, count in one_grams[word_rows]:
    word_counts[year] += (100 * count) / year_totals[year]

Plotting word_counts yields a curve similar to Google’s own visualization, showing the rise of “Python” mentions over time. The article also compares the mentions of “Python”, “Pascal”, and “Perl” after normalising each language’s counts to percentages from 1800‑1960, and presents side‑by‑side images of the author’s results and Google’s original charts.

Performance notes: the Google service generates the chart in about one second, while the Python script takes roughly eight minutes on the same hardware. Pre‑computing yearly totals or indexing the first column in a database could dramatically reduce processing time.

Overall, the piece demonstrates that with NumPy, PyTubes, and commodity hardware, loading, processing, and analysing a multi‑billion‑row dataset in Python is practical.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NumPy PyTubes Google Ngrams

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.