Lightweight Algorithm Service Architecture Based on Offline Tag Knowledge Base and Real‑time Data Warehouse
This article presents a lightweight algorithm service solution that combines an offline pre‑computed tag knowledge base with a real‑time data warehouse using Flink, Doris, Hive SQL and Python to achieve short development cycles, agile iteration, low cost, and scalable deployment for classification and clustering tasks.
Background – Spark Thinking is an online education company that, after the "double reduction" policy, needed fast, cost‑effective algorithm services for user segmentation and targeting. Traditional online model deployment was costly and slow, creating a gap between demand and supply.
Technical Framework – The solution adopts a dual‑module design: an offline module that pre‑computes model predictions into a hash‑based tag knowledge base stored in Doris, and a real‑time module that uses Flink to fetch user tags and query the offline knowledge base. This decouples model training from online inference, reduces resource consumption, and shortens deployment time.
Key Technical Nodes
Feature Discretization – Continuous features are binned (e.g., equal‑frequency, equal‑width, clustering, decision‑tree) and the binning logic is applied consistently in both offline Hive SQL and real‑time Doris.
Label Selection – Low‑value features are removed via RFE or SHAP; rare labels are merged into an "other" category. An automated workflow periodically updates label selections based on sample size and contribution.
Hierarchical "Printing" of Hash Tables – The model is split into layers (5‑6 features per layer). Each layer trains a sub‑model, generates a sub‑hash table, and passes its predictions to the next layer, effectively compressing the hash table while preserving most predictive power.
Historical "Dictionary" vs. Hash "Dictionary" – For scenarios where most feature combinations have appeared historically, a compact historical dictionary replaces the exhaustive hash table, reducing storage and simplifying deployment, though at the risk of missing unseen combinations.
Implementation Case – Using this architecture, Spark Thinking built several algorithm services for new‑customer acquisition, achieving AUC scores of 0.7‑0.8, annualized GMV gains close to ten million RMB, and a total cost of about 600,000 RMB with ROI >10. The system demonstrated high reliability (fewer than five bugs per year) and low maintenance overhead.
Summary and Outlook – The offline‑precompute approach lowers deployment barriers, improves iteration agility, and ensures long‑term model effectiveness through periodic retraining. While the hash‑based solution handles 20‑30 features and the historical dictionary up to ~40, scalability remains limited for very large feature sets, positioning the solution as a practical, cost‑effective option for medium‑size enterprises and short‑term operational strategies.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.