Fundamentals 8 min read

How Twitter Evolved Its Search Engine: From MySQL to Earlybird and Beyond

This article explains the fundamentals of search engine architecture, covering text collection, indexing, ranking and evaluation, and then traces Twitter's internal search evolution from MySQL full‑text search to the Earlybird index server, Blender aggregation, and smart memory‑SSD strategies.

21CTO
21CTO
21CTO
How Twitter Evolved Its Search Engine: From MySQL to Earlybird and Beyond

Introduction

Search has progressed from single‑table queries to multi‑table, multi‑database, internal site full‑text search, and finally to online search engines. Modern search technology now powers many intelligent recommendation products such as product, video, and article recommendations.

Search Engine Principles

Search engines can be classified by usage scenario: internal search, public online search, and meta‑search. Architecturally they consist of two core components: indexing and querying.

1) Text Collection

Web crawlers (spiders, robots) fetch documents from the web, scanning and reading new content. Crawlers can be custom‑built or use standard implementations to collect news, articles, blogs, forums, products, videos, files, etc.

2) Index Creation

The index is essentially a large vocabulary linked to a list of web pages. For example, when a user searches for "python", the engine looks up the term in its index and returns matching pages. Index creation involves text transformation, tokenization, hyperlink analysis, and building inverted indexes.

Text transformation: convert raw text into indexable terms.

Index building: generate inverted lists and compute term weights.

3) Query Processing

The query component provides a search box, handles user input, performs ranking, and returns results via an API.

4) Ranking

Ranking is the core of a search system, ordering documents based on relevance scores derived from the query and index model.

5) Evaluation

Evaluation measures the quality and efficiency of search results.

Twitter's Internal Search Evolution

Initially, Twitter used MySQL full‑text search. New tweets were ingested by an "Ingester" service and stored in time‑sharded MySQL tables. Queries were limited to the most recent three days due to scale constraints.

MySQL full‑text search had limited capabilities: difficult data insertion, limited query syntax, no relevance scoring, and hard to extend.

To meet more complex needs, Twitter adopted the open‑source Lucene library and built "Earlybird", a real‑time inverted‑index server supporting Boolean queries. Earlybird handled indexing while MySQL remained the storage layer.

Twitter deployed many Earlybird instances, each responsible for a partition of the data, and exposed a unified search API. The front‑end queried all Earlybirds, aggregating results, which increased load on the front‑end.

Twitter later created "Blender" to merge results from multiple Earlybird clusters. As requirements grew (e.g., searching protected or grouped tweets), a separate "protectedEarlybird" service was introduced.

To support full‑history search over billions of tweets, Twitter stored the top 2% highest‑quality tweets in memory and about 16% on SSDs, enabling efficient historical queries.

Over time, Earlybird evolved into a platform serving multiple customers, and Blender became a commercializable framework for building search‑related products.

Conclusion

The underlying architecture of search technology is a rich area for study and practical implementation.

References: Twitter Search API: https://api.twitter.com/ Dynamic Memory Allocation Policies for Postings in Real‑Time Twitter Search http://www.umiacs.umd.edu/~jimmylin/publications/Asadi_etal_KDD2013.pdf
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Dataindexingsearch engineinformation retrievalTwitter
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.