Design and Evolution of Incremental Indexing for Advertising Retrieval Systems
The article describes how an advertising retrieval system evolved from serial to parallel full builds and finally to a hybrid incremental indexing approach that records direct entity relationships during assembly, enabling fast reverse‑lookup of changed units via inverted indexes, reducing database load, latency, and rebuild overhead.
This article discusses the design and evolution of incremental (real‑time) indexing in an advertising retrieval system, focusing on how to construct a complete, reliable, and maintainable ad update data flow.
Business background : Online advertising relies on an ad retrieval system where an "index building service" creates material indexes for units (ads) and a search engine loads them for real‑time queries. The data model is complex, involving over 100 relational tables (unit, account, plan, creative, video, up, etc.), leading to high database load and latency as the system scales.
Solution evolution : Three stages are described: 1) Full‑serial: all data queried and built serially. 2) Batch‑parallel: unit IDs are split into batches and processed in parallel respecting data dependencies. 3) Incremental + full: periodic full builds plus incremental builds for changed units.
Incremental index : Two types of ad material indexes are defined: - Full ad index (periodic full build). - Incremental ad index (build only for units with data changes). The article shows the workflow diagrams for both.
Challenge – reverse lookup of changed unit IDs : Determining which unit IDs changed when a downstream entity (e.g., up, video) is updated is difficult due to missing indexes, complex field types, and multi‑hop relationships.
Proposed solution – record direct relationships : By decorating data‑access services during the build process, the system automatically records the association between a unit ID and every other entity ID it accesses (account_id, plan_id, video_id, up_mid). Example interface definitions:
interface UnitBaseService {
// Get basic unit info (needs further enrichment)
UnitBase getUnitById(long unitId);
}
interface CreativeService {
List<Creative> getCreativeByUnitId(int unitId);
}
interface VideoService {
Map<Long, Video> getVideoByVideoIds(Iterable<Long> videoIds);
}
interface UpService {
Map<Long, Up> getUpByUpMids(Iterable<Long> upMids);
}The assembly process uses these services:
class Assembler {
UnitBaseService unitBaseService;
CreativeService creativeService;
VideoService videoService;
UpService upService;
Unit assembleUnit(long unitId) {
UnitBase unitBase = unitBaseService.getUnitById(unitId);
List<Creative> creativeList = creativeService.getCreativeByUnitId(unitId);
// extract videoIds from creativeList
Map<Long, Video> videoMap = videoService.getVideoByVideoIds(videoIds);
// extract upMids from videoMap
Map<Long, Up> upMap = upService.getUpByUpMids(upMids);
// assemble all data into a usable ad unit
Unit unit = doAssemble(unitBase, creativeList, videoMap, upMap);
return unit;
}
}After building, the recorded direct relations are stored as an inverted index (e.g., up_mid → [unit1, unit2]), enabling fast reverse lookup when an upstream entity changes.
Change detection mechanisms :
Binlog trigger: listens to MySQL binlog, extracts changed entity IDs, and uses the inverted index to find affected unit IDs. Provides high timeliness and old‑value information.
Recent‑scan trigger: relies on a mandatory mtime column to detect recent modifications. Simpler but less timely and cannot detect hard deletes.
Frequency reduction (de‑duplication) strategies are introduced to avoid unnecessary rebuilds, including field‑level filtering, hierarchical throttling, state‑based filtering, and low‑priority batching.
Integrated workflow : The article presents a combined pipeline where the inverted relationship table is built during full indexing, updated during incremental indexing, and consulted by both binlog and scan triggers to drive selective unit rebuilds.
Future outlook : Incremental construction dramatically reduces database load, network traffic, and latency, but challenges remain such as simplifying the consumption of mixed full‑plus‑incremental material and further automating relationship capture.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
