How We Built a Scalable Offline‑Online Sequence Modeling System for Community Search
This article details the design of a community‑search pipeline that leverages long‑term user interaction sequences for CTR/CVR prediction, describes the global, online and offline architectures, enumerates the major performance and consistency challenges encountered, and presents the practical optimizations and future directions adopted to achieve reliable, high‑throughput sequence modeling.
Project Background
In community scenarios we have accumulated rich user interaction data, which is crucial for CTR/CVR prediction; longer interaction sequences provide richer preference features but also introduce significant technical challenges.
Architecture Design
Global Architecture
The overall pipeline consists of three layers: a global framework, an online flow, and an offline flow. The global diagram is shown below.
Online Architecture
The online flow builds a real‑time user portrait of length 10k using both full‑table and streaming data, passes the portrait to the SIM engine for hard/soft search, extracts top‑k features, and stores the final sequence for ranking.
Offline Architecture
The offline flow simulates the online processing logic, using request‑PV tables and offline warehouse data (≈100k raw sequences) to generate a 10k→1k→100 sequence that mirrors the online output.
Problems and Challenges
Offline tasks often run 3‑8 hours and frequently OOM due to long‑tail users with extremely long sequences.
Consistency verification across online/offline pipelines produces 25+ diff types, making root‑cause analysis difficult.
Initial business‑scenario analysis was insufficient, leading to hidden issues that only surface during full‑link diff checks.
Repair cycles can take 1‑2 weeks, delaying project progress.
Platform infrastructure limitations such as single‑field sorting, limited timestamp filtering, and heavy index‑building workloads (up to 3 TB) hinder performance.
From Pitfalls to Solutions
We addressed the above challenges with several concrete measures:
Trimmed long‑tail sequences by pre‑processing raw 100k sequences into shorter, month‑based windows, reducing data skew and OOM risk.
Adjusted CPU:MEM ratio to 1:4 and tuned split size and parallelism (xxx.split.size, xxx.num) to improve task throughput.
Minimized custom UDF usage, preferring built‑in functions to avoid costly serialization overhead.
Built a diff‑attribution classification system (e.g., sorting instability, feature leakage) and introduced repeat‑rate statistics to prioritize high‑impact fixes.
Implemented automated diff‑rate clustering to locate common root causes quickly.
Outlook and Summary
Future work will focus on unified offline‑online data consistency (sharing the same raw portrait), tighter integration of the SIM engine, tool upgrades for automated boundary‑based validation, and further infrastructure enhancements to reduce index‑building time and improve feature‑platform capabilities.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
