Big Data 10 min read

Recreating Google Ngram Trends with Python: Analyzing 1.4 Billion Rows Efficiently

This article demonstrates how to use Python, the PyTubes library, and NumPy to load, process, and visualize the massive Google Ngram 1‑gram dataset—over 1.4 billion records—showing performance considerations, data‑cleaning steps, and comparative language trends for Python, Pascal, and Perl.

MaGe Linux Operations

Apr 5, 2022

Recreating Google Ngram Trends with Python: Analyzing 1.4 Billion Rows Efficiently

Google Ngram Viewer is a useful tool that visualizes word usage over time using a massive corpus scanned from books. As an example, the word Python (case‑sensitive) is plotted.

The chart originates from books.google.com/ngrams and shows the frequency of the word ‘Python’ across centuries.

The underlying data is driven by Google’s n‑gram dataset, which records the occurrence of each word or phrase for every printed year from the 16th century to 2008. Although the dataset does not contain every published book, it includes millions of titles and can be downloaded for free.

To reproduce the chart, the author used Python together with a new data‑loading library called PyTubes .

Challenges

The 1‑gram dataset expands to about 27 GB on disk, which is a large volume for Python to ingest. While Python can handle gigabytes of data, processing speed and memory efficiency suffer when the data is corrupted or pre‑processed.

In total, the dataset contains 1.43 billion rows spread across 38 source files, representing roughly 24 million unique words and part‑of‑speech tags from 1505 to 2008. Processing a billion rows slows down considerably, and native Python lacks optimizations for such workloads. Fortunately, NumPy excels at handling large numeric arrays, making the analysis feasible with a few simple tricks.

String handling in Python is memory‑intensive, and NumPy only supports fixed‑length strings, which is suboptimal for words of varying lengths.

Loading the Data

All examples were run on a 2016 MacBook Pro with 8 GB RAM; better hardware would improve performance.

The 1‑gram files are tab‑delimited and look like this:

Each line contains several fields (e.g., word, year, count, etc.). For chart generation we only need the word, year, and count columns.

Using PyTubes, the relevant fields are extracted, ignoring the overhead of variable‑length strings.

After loading, the 1‑gram data becomes a NumPy array with about 1.4 billion rows, displayed as:

Is_Word  Year  Count
0        1799   2
0        1804   1
0        1805   1
...

From here, the analysis reduces to NumPy operations.

Computing Yearly Word Totals

Google reports each word’s percentage of total word usage per year (word count / total words that year). To compute this, we first need the total word count per year, which NumPy makes straightforward.

Plotting this reveals a sharp decline in total word counts before 1800, which can distort trends. Therefore, only data from 1800 onward is retained, reducing the dataset to about 1.3 billion rows (the pre‑1800 portion accounts for only 3.7% of the total).

The percentage of Python per year is then easily calculated using a simple NumPy trick: an array of length 2008 where the index corresponds to the year, allowing direct access like year_array[1995].

Performance

Google’s own chart renders in about one second, whereas the Python script takes roughly eight minutes. Pre‑computing yearly totals and storing them in a lookup table or separate database would dramatically reduce runtime.

This exploration shows that with NumPy, the emerging PyTubes library, and commodity hardware, loading, processing, and extracting statistics from a billion‑row dataset is practical.

Language War

To illustrate the approach, the author compared three programming languages—Python, Pascal, and Perl—by extracting their case‑sensitive mentions (e.g., “Python” not “python”) and normalizing counts to percentages from 1800 to 1960, accounting for Pascal’s first appearance in 1970.

Comparisons with Google’s unadjusted data are also shown.

The entire process took just over ten minutes.

Future PyTubes Improvements

Current PyTubes uses a single 64‑bit integer type, which can be wasteful for n‑gram data. Adding support for 1, 2, and 4‑bit integer types could reduce memory usage by up to 60% (the full ndarray is ~38 GB). Additional filtering logic (e.g., Tube.skip_unless()) and richer string‑matching utilities (startswith, endswith, contains, is_one_of) are also planned.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Python data analysis NumPy NGram

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.