Big Data 11 min read

Online Sample Generation with Flink: Architecture and Implementation

This article explains why Flink is chosen for online sample generation, describes the end‑to‑end implementation steps—including stream union, state‑timer processing, and output formatting—covers state backend choices, monitoring, validation, fault handling, and platformization for scalable real‑time machine‑learning pipelines.

DataFunTalk

Sep 13, 2020

Online Sample Generation with Flink: Architecture and Implementation

Online machine learning provides better timeliness, iteration cycles, and experimental effectiveness compared to offline approaches, making the migration to online learning an effective way to improve business metrics. This article details how Weibo uses Apache Flink to implement online sample generation.

After evaluating popular real‑time computation frameworks (Storm 0.10, Spark 2.11, Flink 1.10), Flink was selected for its stability, fault‑tolerance, and flexibility, especially for the strict latency and accuracy requirements of online sample generation.

The implementation processes exposure and click logs in real time, joins them, and writes the resulting samples to Kafka for downstream online training jobs. The time window for joining is first determined offline (e.g., a 20‑minute window achieving an 85% join ratio) and then applied online.

Key implementation steps are:

Read input streams (Kafka, Trigger, MQ) in real time.

Format and filter logs, select join keys (e.g., user‑id and content‑id), and output as Tuple2<K,V>.

Union multiple streams using Flink’s union operation and assign distinguishable aliases.

Apply keyBy on the join key; handle data skew by adding a random prefix before aggregation.

Maintain a ValueState and register a processing‑time timer; in processElement create or update the state, and in onTimer emit the sample and clear the state.

After the timer fires, format the joined sample (JSON, CSV) and filter out invalid records.

Sample code (StateSampleFunction) illustrating the state‑timer logic:

public class StateSampleFunction extends KeyedProcessFunction<String, Tuple2, ReturnSample> {
    private ValueState<ReturnSample> state;
    private Long timer = null;
    public StateSampleFunction(String time) { timer = Long.valueOf(time); }
    @Override
    public void open(Configuration parameters) throws Exception {
        state = getRuntimeContext().getState(new ValueStateDescriptor<>("state", TypeInformation.of(new TypeHint<ReturnSample>() {})));
    }
    @Override
    public void processElement(Tuple2 value, Context context, Collector<ReturnSample> out) throws Exception {
        if (value.f0 == null) return;
        ReturnSample sample = state.value();
        if (sample == null) {
            sample = new ReturnSample();
            sample.setKey(value.f0);
            sample.setTime(context.timerService().currentProcessingTime());
            context.timerService().registerProcessingTimeTimer(sample.getTime() + timer);
        }
        // update state with click or exposure data
        state.update(sample);
    }
    @Override
    public void onTimer(long timestamp, OnTimerContext ctx, Collector<ReturnSample> out) throws Exception {
        ReturnSample sample = state.value();
        state.clear();
        out.collect(sample);
    }
}

For state storage, RocksDB or Gemini is preferred over in‑memory state; Gemini combined with SSD delivers the best performance for large state sizes.

Comprehensive monitoring is provided, covering Flink job health (failures, failover, checkpoint errors, RocksDB usage, consumer lag, back‑pressure) and sample‑specific metrics (join rate, positive sample count, output format validation, tag value ranges).

Validation includes online‑offline cross‑checks, offline storage of samples in HDFS for comparison, and whitelist user end‑to‑end verification using Elasticsearch.

When sample anomalies are detected, alerts are sent to pause model updates, and faulty data can be discarded; Kafka offsets can be reset to regenerate correct samples.

The platform abstracts most configuration to a UI: users only implement custom data‑cleaning UDFs for input and output, while the platform handles Kafka connections, window settings, aggregation logic, resource allocation, job submission, and automatic monitoring.

Architecture diagrams and UI screenshots (omitted here) illustrate the end‑to‑end flow of the online sample generation platform built on Flink.

Author: Cao Fuqiang, Senior System Engineer at Weibo Machine Learning R&D Center, focusing on real‑time computation with Flink, Kafka, Redis, and offline processing with Hive and Spark.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Flink stream processing Kafka Online Sample Generation

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.