Big Data 8 min read

Analyzing Google Ngram Data with Python, NumPy, and PyTubes

This article demonstrates how to download the Google 1‑gram dataset, load its 1.4 billion rows using NumPy and the PyTubes library, compute yearly word‑frequency percentages for terms like Python, and visualize the results while discussing performance challenges and future improvements.

Python Programming Learning Circle

Mar 29, 2024

Analyzing Google Ngram Data with Python, NumPy, and PyTubes

Google Ngram Viewer is a useful tool that visualizes the frequency of words over time by scanning massive corpora from printed books.

The author uses the word Python as an example and notes that the 1‑gram dataset occupies about 27 GB on disk, containing roughly 1.43 billion rows spread across 38 source files and 24 million distinct words with part‑of‑speech tags.

Processing such a volume in native Python is slow and memory‑intensive, so the author leverages NumPy together with a new data‑loading library called PyTubes to efficiently read and filter the tab‑separated files.

The raw 1‑gram lines look like this:

╒═══════════╤════════╤═════════╕
│ Is_Word │ Year │ Count │
╞═══════════╪════════╪═════════╡
│ 0 │ 1799 │ 2 │
│ 0 │ 1804 │ 1 │
│ 0 │ 1805 │ 1 │
│ 0 │ 1811 │ 1 │
│ 0 │ 1820 │ ... │
╘═══════════╧════════╧═════════╛

After loading, NumPy makes it straightforward to compute the total number of words per year and the percentage that each target word represents, enabling the recreation of Google’s original plots.

To avoid distortions caused by sparse early‑century data, the analysis discards records before 1800, reducing the dataset to about 1.3 billion rows (the pre‑1800 portion accounts for only 3.7 % of the total).

The script then calculates yearly percentages for Python and visualizes them, noting that Google’s own rendering takes about one second while the Python script requires several minutes, highlighting opportunities for optimization such as pre‑computing yearly totals or indexing the data.

As a more complex case study, the author compares mentions of three programming languages—Python, Pascal, and Perl—by filtering for capitalized forms and normalizing counts to percentages from 1800 onward, presenting the resulting trends alongside Google’s baseline.

Finally, the article outlines planned enhancements for PyTubes, including support for smaller integer types (1, 2, and 4 bits), richer filtering combinators, and improved string‑matching utilities to further speed up large‑scale text processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data analysis visualization NumPy NGram

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.