Backend Development 17 min read

Rebuilding QQ Mail Full-Text Search with Elasticsearch: Architecture, Implementation, and Optimization

To overcome aging hardware and code limitations, QQ Mail rebuilt its full‑text search using Tencent Cloud Elasticsearch, adding an esproxy layer, MQ‑driven updates, HTML‑to‑text extraction, protobuf‑JSON conversion, index sorting, two‑stage precise/fuzzy queries, and custom tokenizers, delivering scalable, low‑latency email search.

Tencent Cloud Developer

Oct 30, 2020

Rebuilding QQ Mail Full-Text Search with Elasticsearch: Architecture, Implementation, and Optimization

With the rapid increase of user email volume, full‑text search has become a core function of any mailbox. QQ Mail’s self‑developed search engine, launched in 2008, is now constrained by aging storage hardware, data loss risks, complex and hard‑to‑maintain code, a custom KV store that is no longer serviced, lack of original‑text storage (preventing native highlighting), and no indexing of oversized attachment names.

1. Reconstruction Background – The aging machines and the above limitations provide an opportunity to rebuild the full‑text search backend and migrate the stored data.

2. New Full‑Text Search Architecture – The solution adopts Elasticsearch , a distributed search engine built on Lucene , for its scalability, stability, and maintainability. The QQ Mail fullsearch module communicates with Tencent Cloud ES via HTTP REST APIs, sending JSON payloads converted from the existing protobuf structures. An esproxy layer provides a curl connection pool, while a message queue (MQ) smooths traffic spikes for add/delete/update operations and ensures reliable delivery.

3. Email Search Characteristics – Unlike internet search, email search operates on a single user’s mailbox, requires exact results, and sorts primarily by time (with optional filters by sender, read status, etc.). This write‑heavy, read‑light pattern influences the design of indexing and query handling.

4. Backend Architecture Details

The overall flow of the fullsearch module is:

Upstream operations (add, delete, modify) trigger asynchronous updates to ES documents via MQ.

Search requests (ordinary and advanced) are processed synchronously, using two query types: match_phrase for precise matches and match (operator=and) for fuzzy matches.

Results are filtered to exclude deleted emails and to provide a fallback mechanism when ES is unavailable.

5. Implementation Details – HTML Extraction

Emails store their body as HTML. To avoid wasteful storage and to enable accurate highlighting, the HTML is stripped to plain text. The extraction logic keeps only text nodes, extracts large‑attachment names, and removes nodes with display:none. Example HTML snippet:

<body class="global">
  <div class="container">
    <div class="head content">
      <h3>您好！</h3>
    </div>
  </div>
</body>

Nodes such as <span style="display:none;">…</span> are also filtered:

<span style="display:none;">:http://wx.mail.qq.com/ftn/download?...</span>

The project uses pugixml for fast XML/HTML parsing, with ekhtml as a fallback for non‑well‑formed HTML.

6. Protobuf ↔ JSON Conversion

Since the mailbox backend heavily relies on protobuf, conversion to JSON for ES interaction is performed via Google’s utilities MessageToJsonString and JsonStringToMessage, simplifying serialization and deserialization.

7. Search Optimization

Index sorting (ES index.sort) stores documents in uin order, dramatically reducing the search range for a given user.

Two‑stage search: first attempt a match_phrase (precise) query; if no hits, fall back to a match (fuzzy) query. This covers 90% of requests with precise results while keeping latency low.

Result pruning for fuzzy searches removes low‑score hits, and field‑level boosting (e.g., higher weight for subject) improves relevance.

Adjusting

match_phrase

slop

(e.g., slop=4) tolerates token distance variations without sacrificing precision.

8. Tokenizer Enhancements

Default IK tokenizer struggled with alphanumeric order numbers (e.g., AL0927_618). Two solutions were explored:

Pre‑process the query to insert a space between letters and numbers.

Develop a custom tokenizer xm_ik_max_word that filters out LETTER type tokens, preserving only the numeric part.

For cases where trailing letters are stop‑words (e.g., 20X07131A), a whitespace tokenizer is used, allowing users to control tokenization and achieve exact matches.

9. Conclusion

Leveraging Tencent Cloud ES as a PaaS solution enabled rapid construction of a scalable full‑text search service, resolved the legacy issues of the original system, and provided a flexible foundation for future enhancements such as custom query syntax and further tokenizer refinements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud services backend-architecture Elasticsearch protobuf Full-Text Search search optimization

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.