Artificial Intelligence 12 min read

How We Built a Scalable Offline‑Online Sequence Modeling System for Community Search

This article details the design of a community‑search pipeline that leverages long‑term user interaction sequences for CTR/CVR prediction, describes the global, online and offline architectures, enumerates the major performance and consistency challenges encountered, and presents the practical optimizations and future directions adopted to achieve reliable, high‑throughput sequence modeling.

DeWu Technology

Jul 16, 2025

How We Built a Scalable Offline‑Online Sequence Modeling System for Community Search

Project Background

In community scenarios we have accumulated rich user interaction data, which is crucial for CTR/CVR prediction; longer interaction sequences provide richer preference features but also introduce significant technical challenges.

Architecture Design

Global Architecture

The overall pipeline consists of three layers: a global framework, an online flow, and an offline flow. The global diagram is shown below.

Online Architecture

The online flow builds a real‑time user portrait of length 10k using both full‑table and streaming data, passes the portrait to the SIM engine for hard/soft search, extracts top‑k features, and stores the final sequence for ranking.

Offline Architecture

The offline flow simulates the online processing logic, using request‑PV tables and offline warehouse data (≈100k raw sequences) to generate a 10k→1k→100 sequence that mirrors the online output.

Problems and Challenges

Offline tasks often run 3‑8 hours and frequently OOM due to long‑tail users with extremely long sequences.

Consistency verification across online/offline pipelines produces 25+ diff types, making root‑cause analysis difficult.

Initial business‑scenario analysis was insufficient, leading to hidden issues that only surface during full‑link diff checks.

Repair cycles can take 1‑2 weeks, delaying project progress.

Platform infrastructure limitations such as single‑field sorting, limited timestamp filtering, and heavy index‑building workloads (up to 3 TB) hinder performance.

From Pitfalls to Solutions

We addressed the above challenges with several concrete measures:

Trimmed long‑tail sequences by pre‑processing raw 100k sequences into shorter, month‑based windows, reducing data skew and OOM risk.

Adjusted CPU:MEM ratio to 1:4 and tuned split size and parallelism (xxx.split.size, xxx.num) to improve task throughput.

Minimized custom UDF usage, preferring built‑in functions to avoid costly serialization overhead.

Built a diff‑attribution classification system (e.g., sorting instability, feature leakage) and introduced repeat‑rate statistics to prioritize high‑impact fixes.

Implemented automated diff‑rate clustering to locate common root causes quickly.

Outlook and Summary

Future work will focus on unified offline‑online data consistency (sharing the same raw portrait), tighter integration of the SIM engine, tool upgrades for automated boundary‑based validation, and further infrastructure enhancements to reduce index‑building time and improve feature‑platform capabilities.