Big Data 8 min read

Large‑Scale Twitter Data Collection and Analysis: From Crawling to Sentiment and Market Correlation

The article describes a two‑year, 400‑billion‑tweet crawling project, its statistical and sentiment analyses linking sleep patterns, weekdays, holidays, and market indices, and the low‑cost technical infrastructure built to store and query the massive dataset.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Large‑Scale Twitter Data Collection and Analysis: From Crawling to Sentiment and Market Correlation

During a summer internship at Google in 2011 the author began a personal project to crawl Twitter, eventually collecting over 400 billion tweets from more than ten million users. The motivation was a paper that used Twitter mood to predict the stock market.

Analysis section

Word‑frequency statistics show that many users tweet "sleep" before going to bed, and the term "Thursday" spikes on Thursdays, especially on February 2nd. Unsupervised learning was applied to assign each tweet a happiness score between 0 and 1; daily averages reveal clear weekly cycles, with peaks on weekends and notable spikes on New Year’s Day and Valentine’s Day.

Filtering the dataset to users identified as investors or traders shows that their sentiment line (blue) tracks the S&P 500 index (red), suggesting higher happiness when the market performs well.

During the 2012 U.S. election year, the proportion of Obama‑related tweets that also mention the economy correlates positively with the unemployment rate and negatively with the S&P 500, indicating that economic hardship drives political criticism.

The author notes that these analyses are statistically informal, suffer from demographic bias, and should be viewed as exploratory.

Technical section

Twitter’s API limits each IP to 150 recent tweets per hour, making a full crawl of billions of users impractical without many IPs. By filtering out inactive, non‑English, and low‑follower accounts, the author reduced the target set to about 10 million users, making a ten‑year crawl feasible.

To accelerate the process, a pool of free proxy servers was gathered and managed with a custom proxy‑management system, yielding a few hundred usable proxies daily and allowing a full crawl of ten million users in roughly two weeks.

Dynamic throttling based on follower count further improved real‑time coverage of important accounts.

Over 1.5 years the system harvested 400 billion tweets (~10 TB), roughly half of all U.S. tweets at the time. Storage was handled on a self‑built 12 TB disk array using MySQL, heavily tuned with partitioning, ordering, and indexing to enable linear scans of 10‑20 billion rows per day and rapid retrieval of specific days or tweets.

The server was later donated to MIT for research. The project ended in 2013 when Twitter’s API changed and social‑media interest waned. All data remained for academic use only, never sold, and the code (written in Java) was not open‑sourced due to quality concerns.

Big DataData Miningsentiment analysisTwitterweb crawlingMarket Correlation
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.