Big Data 17 min read

Revolutionizing Feature Engineering with Distributed Tech & Configurable Services

Facing PB‑scale user behavior data and millions of feature dimensions, the platform transformed its search, advertising, and recommendation pipelines by adopting a distributed, configurable‑service architecture that delivers high‑throughput streaming, elastic storage, rapid feature iteration, and robust fault‑tolerance for AI‑driven personalization.

Instant Consumer Technology Team
Instant Consumer Technology Team
Instant Consumer Technology Team
Revolutionizing Feature Engineering with Distributed Tech & Configurable Services

Introduction

In today’s digital wave, search, advertising, and recommendation (the "three pillars") form the core engine of the internet ecosystem. Feature engineering, the bridge between data and models, directly determines the upper bound of algorithm performance.

Background

The platform, a fintech pioneer, accumulated petabyte‑scale user‑behavior data and millions of feature dimensions. Three major bottlenecks emerged:

Data throughput: The shared Doris cluster could only handle ≤30 k/s, far below PB‑level streaming needs.

Iteration efficiency: Feature development cycles stretched to days, lagging behind rapid business changes.

Service stability: Centralized architecture caused MTTR > 10 minutes and single‑point failures.

To break these limits, a new distributed‑plus‑configurable architecture was required.

Architecture Evolution

Before optimization the data flow was Hive → Doris → Elasticsearch, running on a few servers with limited parallelism and tight coupling to the search‑ad‑recommendation platform.

After optimization the pipeline became Hive → HBase, with elastic compute nodes, PB‑day throughput, real‑time Top‑N feature calculation, and task orchestration moved to a DataOps platform, decoupling releases from feature jobs.

Key improvements :

Distributed computing: Flink provides stateful stream processing with millisecond latency; HBase offers high‑performance random reads/writes.

Distributed storage: HBase stores massive features, while Redis handles low‑latency lookups.

Configurable service middle‑platform: Enables data lineage queries, automatic feature‑field mapping, one‑click task generation, and reduces feature development cycles to hours.

Resilience: Automatic retry, DataOps‑based distributed scheduling, and HBase multi‑active disaster recovery raise system availability from 90% to 99%.

Core Design

The "distributed feature + configurable service" system is built for low latency and high throughput. It consists of a layered data model:

ODS (source layer) : Raw logs and business DB data with light cleaning.

DWD (detail layer) : Cleaned, joined, and dimension‑reduced data.

APP (data‑mart layer) : Aggregated wide tables for long‑ and short‑term user behavior.

ADS (application layer) : Highly aggregated metrics for search, ads, and recommendation, powered by Flink + HBase.

DIM (dimension layer) : Slowly changing business dimensions.

RowKey design follows hash pre‑partitioning, short length, and uniqueness to avoid hotspotting and reduce storage overhead. Example:

concat_ws(':', SUBSTRING(md5(unique_id),1,4),feature_name,unique_id)

Feature lifecycle management uses HBase TTL to automatically purge expired data, cutting storage cost and simplifying business logic.

Flink + HBase Feature Aggregation

Flink handles real‑time stateful aggregation; HBase stores the resulting features for millisecond‑level queries. Benefits include:

Real‑time, low‑latency processing.

Efficient large‑state handling with exactly‑once guarantees.

Horizontal scalability and fault tolerance via Flink checkpoints and HBase persistence.

Decoupled compute and storage, enabling feature reuse across downstream applications.

Practical Cases

Performance jumped from 15 k TPS to 550 k TPS, with batch latency improved by over 90%.

Business continuity was ensured through automated failover and multi‑active HBase deployment.

Future Outlook

The platform will evolve toward a unified stream‑batch architecture, eliminating the traditional Lambda split, and aim for end‑to‑end real‑time data pipelines.

AI‑enhanced data operations will automatically optimize layout, indexing, and compression, while intelligent governance will auto‑label and detect data quality issues.

Feature platforms will become a central "feature factory" tightly integrated with the data warehouse, supporting attribution analysis and self‑service data exploration.

Ultimately, the system will become a data‑intelligent operating system offering low‑code tools, natural‑language queries, and a seamless experience for analysts and data scientists.

Appendix – Basic Concepts

Data Sources : Kafka, Doris, Elasticsearch, Redis, Hive, HBase.

Data Ingestion : Defines schemas and imports external data into the platform.

Wide Tables, Indexes, Features : Logical schemas that map to physical storage (ES, Redis, HBase) for efficient querying and model training.

Transformation Management : Supports business‑table→wide‑table, wide‑table→feature, and wide‑table→index conversions.

Tasks : Smallest executable units for data integration, cleaning, transformation, modeling, and aggregation, orchestrated by workflow tools with dependency management and retry mechanisms.

distributed-systemsbig datareal-time processingfeature engineeringdata architecture
Instant Consumer Technology Team
Written by

Instant Consumer Technology Team

Instant Consumer Technology Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.