Evaluation Framework for Search Retrieval Systems: Speed, Relevance, Recall, and Freshness
The article introduces a four‑dimensional evaluation framework for retrieval systems—Fast, Accurate, Complete, and New—explaining how each metric is measured, why it matters to users, and how crowdsourced testing across devices and networks can provide objective quality assessments.
A retrieval system, also known as a search engine, collects information from the Internet according to specific strategies and programs, organizes and processes it, and then returns results that match the user’s query.
The core task of a retrieval system is to return relevant information based on user input, and the quality of retrieval depends on how well this task is performed. Measuring retrieval quality is a fundamental problem in the field; only after understanding system performance can we improve it.
To ensure objectivity, we evaluate retrieval quality from the user’s perspective using a four‑dimensional framework: Fast, Accurate, Complete, and New.
1. Fast reflects the system’s response time after a user submits a query. The response time consists of network communication time and server processing time. To obtain realistic speed metrics, we recruit crowd‑sourced machines in various regions (including PCs and mobile devices) to periodically issue queries and record timing data, then compute a weighted average across devices and network conditions.
2. Accurate reflects the relevance of results to the user’s query and the reasonableness of ranking. Relevance is often evaluated manually (e.g., Cranfield, MRR) but due to cost we supplement with automated sub‑metrics such as click‑through rate, dwell time, page‑turn rate, satisfied click rate, and inverse click order, derived from user behavior on result pages.
3. Complete reflects recall: how many relevant results the system returns. Recall is assessed in two parts—coverage (whether the system has indexed enough relevant pages) and return (whether indexed pages are actually retrieved). Sampling methods, competitive benchmarking, and query‑driven retrieval tests are used to measure both aspects.
4. New reflects freshness: whether the returned content is up‑to‑date. Evaluation distinguishes between breaking‑news, periodic, generic, and non‑time‑sensitive queries, applying different algorithms to judge timeliness. Freshness is measured by checking for news‑type results and extracting timestamps from pages via parsing and content analysis.
This concludes the introduction; readers are invited to discuss further.
Follow the "Baidu Quality Department" subscription for more in‑depth articles.
-----------------------------------------------------------
1. Reply with the keyword “Evaluation” to view the evaluation series articles. 2. Reply with the keyword “CI” to view CI experience sharing. 3. Reply with the keyword “Mobile Testing” to view mobile testing tool reviews.
Baidu Intelligent Testing
Welcome to follow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.