Big Data 7 min read

Global Feature Pool Architecture and Workflow for Data‑Driven Growth

The article describes a unified global feature pool architecture that standardizes offline and real‑time feature production, management, and service layers using Hive, Spark, Flink, Kafka, MySQL, and Hologres to break data silos, improve algorithm development efficiency, and boost growth business performance.

TAL Education Technology

Apr 15, 2021

Global Feature Pool Architecture and Workflow for Data‑Driven Growth

Background: Different business units have fragmented user growth modeling scenarios, leading to data silos and inconsistent real‑time feature production capabilities. To address this, the central platform proposes a global feature pool that standardizes feature data production, management, and service processes, enhancing algorithm development efficiency.

Overall Project Architecture: The architecture consists of three layers:

Feature Production Layer: Sources include offline data stored in Hive processed via Hive/Spark to generate user, item, and interaction feature tables, and real‑time data via Kafka and Flink written to KV stores like Hologres.

Feature Management Layer: Features are cataloged in MySQL with metadata, tags, and logs. Feature tables are registered to the platform, default read‑only for all users, and any modifications require re‑registration and downstream notifications. The platform also handles scheduling and quality monitoring, taking offline features offline when anomalies occur.

Feature Service Layer: Offline features can be accessed directly through the T‑mining platform for sample stitching and feature selection. Real‑time feature needs are served via unified data interfaces.

Feature Usage Process: Two usage modes exist: offline training/prediction via algorithm platform sample‑feature stitching, and real‑time prediction requiring immediate feature retrieval through model services. Most business scenarios use offline features, supported by the T‑data + T‑mining platform.

Workflow Template: The process includes Input, Transform, and Output stages with components such as Label data preprocessing (Hive/Spark extraction of consistency, custom, or historical samples), Feature Sample Association (joining new experimental features), Feature Engineering (statistical analysis, selection, importance, and data export to Hive/HDFS/OSS), and Data Output.

Current Feature Situation: Over 1,000 offline user‑level feature dimensions are built from business data warehouses, CDP, event logs, chat text, and profile tags, with daily T+1 scheduling. Real‑time feature tables are under joint development.

Results: The global feature pool has improved conversion rates in recommendation and banner ad projects, achieving 15%+ and 27%+ gains respectively compared to manual strategies.

Future Outlook: The platform is still early; upcoming improvements include better feature selection UI, clearer attribute and distribution views, visualization tools, and expanded functionality for registration, service, real‑time production, and automatic selection. The central data warehouse will continue to add more dimensions, allowing business units to contribute features for shared reuse.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline machine learning feature engineering Data Platform real-time features offline features

Written by

TAL Education Technology

TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.