Backend Development 12 min read

How to Build a High‑Precision Exam Question Search Engine with BM25 and LTR

This article explains the architecture and key algorithms behind a specialized exam‑question search engine, covering query parsing, BM25‑based relevance scoring, text‑image hybrid retrieval, Learning‑to‑Rank models, and practical optimizations for long queries and large K‑12 datasets.

TiPaiPai Technical Team

May 8, 2021

How to Build a High‑Precision Exam Question Search Engine with BM25 and LTR

Person Introduction

Bin Bin graduated from Peking University’s Department of Mechanics and Engineering. He previously worked at Baidu and other tech firms on products such as OA, e‑publishing, instant messaging, browsers, international maps, autonomous driving, group buying, and web search. Since 2018 he has researched machine‑learning‑accelerated materials science at the University of Michigan (published in Nature) and now leads search algorithm development for the "PaiTi" product.

Overview of Exam Question Search

Like general web search, the system must rank results by relevance, quality, and popularity, placing the best results first. User queries are split: textual parts are processed via OCR and text retrieval, while graphical parts are encoded by deep‑learning models for vector retrieval. The results from both modalities are merged and re‑ranked, with a focus on selecting the highest‑quality matches.

Compared with generic web search, the question‑search domain has a smaller, richer dataset, allowing more targeted accuracy improvements.

Key Challenges

Query Parsing

A typical query contains both text and graphics; the text is extracted via OCR, and the graphics are vectorized for similarity search. Real‑world queries often suffer from over‑exposure, poor coverage, or missing information, posing significant challenges.

Basic Relevance

The core retrieval uses Elasticsearch. BM25 scores are used for initial recall, preparing data for later Learning‑to‑Rank (LTR) sorting. The BM25 implementation follows the formula below:

def BM25(docfreq, N, freq, avgdl, dl):
    n = docfreq    # number of documents containing term
    # N # total number of documents with field
    # avgdl = 58.7 # average length of field
    # dl = 264 # length of field (approximate)
    b = 0.75 # length normalization parameter
    k1 = 1.2 # term saturation parameter
    boost = 6.6 # or 3
    tf = freq / (freq + k1 * (1 - b + b * dl / avgdl))
    idf = math.log(1 + (N - n + 0.5) / (n + 0.5))
    score = boost * idf * tf
    return score

Optimizing tokenization, formula parsing, and feature extraction is crucial for recall quality.

Text Similarity

Simple BM25 scores are insufficient for ranking because OCR errors and paraphrased questions are common. Multiple text‑similarity scoring functions are used as features for the LTR model to balance robustness to OCR noise and sensitivity to subtle textual differences.

Learning to Rank and Learning for Match

After recall, results are re‑ranked using an LTR model (LambdaRank) based on NDCG and implemented with LightGBM. Because NDCG‑based loss functions are non‑differentiable, specialized techniques are applied. The LTR model uses over 80 features capturing relevance, quality, and popularity signals, trained on more than 40,000 labeled K‑12 questions, achieving a top‑1 accuracy of about 96%.

The system also includes a pointwise “Learning for Match” model that refines LTR results by matching queries to exact questions using both image‑based and text‑based features.

System Implementation

While a full inverted‑index implementation would require dense hash maps, tries, and skip‑list compression, the team leverages Elasticsearch + Lucene to handle most low‑level details. Specific adaptations include:

Common Term Query : handling high‑frequency words and formulaic queries by generating abstract feature tags to boost BM25 scores.

Field‑Specific Strategies : assigning higher weight to sub‑question text versus shared passage text to improve discrimination.

Long Query Reduction : extracting semantic fingerprints from average 60‑word queries to improve recall efficiency and reduce latency.

Conclusion

The "PaiTi" system indexes over 260 million documents covering K‑12 subjects in multiple languages and formats. Although built on generic web‑search infrastructure, the unique data characteristics and query patterns demand specialized algorithms and engineering solutions. The authors acknowledge possible errors and invite feedback.

search engine Ranking BM25 Learning-to-Rank elastic search query parsing

Written by

TiPaiPai Technical Team

At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.