Big Data 9 min read

Recreating Google Ngram Trends with Python, PyTubes, and NumPy

This article demonstrates how to download the Google 1‑gram dataset, load and filter billions of rows with the PyTubes library, compute yearly word frequencies using NumPy, and reproduce the classic Python usage trend chart while discussing performance considerations and future improvements.

MaGe Linux Operations

May 28, 2019

Recreating Google Ngram Trends with Python, PyTubes, and NumPy

Google Ngram Viewer is a useful tool that visualizes word usage over time based on a massive corpus of scanned books. The author uses the public 1‑gram dataset (covering the 16th century to 2008) to recreate the popularity curve of the word Python .

Challenge

The 1‑gram data expands to about 27 GB on disk, containing 1.43 billion rows across 38 files and roughly 24 million distinct words. Processing such a volume with plain Python is slow and memory‑inefficient, but NumPy can handle large numeric arrays efficiently.

Loading the data

All examples run on a 2016 MacBook Pro with 8 GB RAM; better hardware will improve performance.

The dataset is tab‑separated with four fields: word, year, count, and number of books. Only the rows where the word equals Python and the year is after 1799 are needed.

import tubes
FILES = glob.glob(path.expanduser("~/src/data/ngrams/1gram/googlebooks*"))
WORD = "Python"
one_grams_tube = (tubes.Each(FILES)
    .read_files()
    .split()
    .tsv(headers=False)
    .skip_unless(lambda row: row.get(1).to(int).gt(1799))
    .multi(lambda row: (
        row.get(0).equals(WORD.encode('utf-8')),
        row.get(1).to(int),
        row.get(2).to(int)
    )))

After about 170 seconds the tube contains a NumPy array with roughly 1.4 billion rows.

Yearly total word counts

To compute the percentage of Python usage each year, the total number of words per year is required. NumPy’s histogram makes this straightforward.

last_year = 2008
YEAR_COL = '1'
COUNT_COL = '2'
year_totals, bins = np.histogram(
    one_grams[YEAR_COL],
    density=False,
    range=(0, last_year+1),
    bins=last_year+1,
    weights=one_grams[COUNT_COL]
)

Computing yearly percentages

word_rows = one_grams[IS_WORD_COL]
word_counts = np.zeros(last_year+1)
for _, year, count in one_grams[word_rows]:
    word_counts[year] += (100 * count) / year_totals[year]

Plotting word_counts yields a curve similar to Google’s original chart, though the absolute percentages differ due to dataset nuances (e.g., inclusion of Python_VERB ).

Performance notes

Google’s chart renders in about one second, whereas the Python script takes roughly eight minutes on the same hardware. Pre‑computing yearly totals or storing intermediate results in a database can dramatically reduce runtime.

Language‑war comparison

To illustrate the method’s flexibility, the author compares mentions of three programming languages—Python, Pascal, and Perl—using only capitalized forms (e.g., Python , not python ). Percentages are normalized to the 1800‑1960 baseline, with Pascal’s first appearance in 1970 providing a sensible reference.

Future improvements to PyTubes

Current PyTubes stores integers as 64‑bit values, which is wasteful for many datasets. Adding support for 1, 2, and 4‑bit integer types could cut memory usage by up to 60 %. Additional features such as combined filter conditions (AND/OR/NOT) and richer string‑matching utilities (startswith, endswith, contains, is_one_of) are also planned.

Contributions and patches are welcomed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Python data analysis NumPy Google Ngram PyTubes

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.