Big Data 8 min read

Real‑Time Data Warehouse Development with Flink: Architecture, Implementation, and Lessons Learned

This article describes the motivation, technology selection, implementation details, and practical challenges of building a real‑time data warehouse using Flink, covering stream ingestion, data cleaning, dimension‑table joins, state backend choices, and operational lessons for large‑scale streaming pipelines.

HomeTech

Aug 15, 2019

Real‑Time Data Warehouse Development with Flink: Architecture, Implementation, and Lessons Learned

Traditional offline data warehouses focus on reusable dimensional and relational models, but they cannot meet the low‑latency requirements of many modern business scenarios, leading to data silos and resource waste when each real‑time application re‑processes raw streams.

To address the growing demand for timely data at the company, a five‑layer real‑time warehouse (ingestion → computation → storage → service → application) was planned, with the computation layer built on Apache Flink after comparing Storm, Spark‑Streaming, and Flink.

Flink was chosen for its high throughput, millisecond‑level latency, exactly‑once semantics, and flexible windowing. The overall pipeline reads Kafka topics, applies a Mapper for filtering and cleaning, a Flater for dimension‑table enrichment and flattening, and writes the results back to Kafka.

private static void proc(String[] args) throws Exception {
    // 获取运行时环境
    final StreamExecutionEnvironment env = CommonUtil.getOnlineStreamEnv(TimeCharacteristic.ProcessingTime);
    // 创建kafkaconsumer
    FlinkKafkaConsumer010 consumer = CommonUtil.genKafkaConsumer(Constants.TOPIC_WEB_APPLOG, args);
    // source -> map(数据过滤,清洗) -> flatMap(关联维表) -> sink
    DataStream<String> stream = env.addSource(consumer)
            .map(new AppLogEntity.Mapper())
            .flatMap(new AppLogEntity.Flater())
            .returns(String.class);
    stream.process(new ProcessFunction<String, String>() {
        @Override
        public void processElement(String value, Context ctx, Collector<String> out) {
            out.collect(value);
        }
    }).addSink(CommonUtil.genUasKafkaProducer());
    env.execute(JOB_NAME);
}

public static class Mapper implements MapFunction<String, AppLog> {
    @Override
    public AppLog map(String line) {
        line = line.trim().replace("\007", "").replace("\t", "").replaceAll("\
", "");
        List<String> tokens = StringUtil.fastSplit(line, Constants.APP_LOG_ORIGINAL_SPERATOR, 40);
        if (tokens.size() < 40) return null;
        switch (tokens.get(0)) {
            case Constants.APP_ACT_LOG_PREFIX: return new AppUsingLogEntity().parse(tokens);
            case Constants.APP_EVENT_LOG_PREFIX: return new AppEventLogEntity().parse(tokens);
            case Constants.APP_CLT_LOG_PREFIX: return new AppCltEntity().parse(tokens);
            default: return null;
        }
    }
}

public static class Flater extends RichFlatMapFunction<AppLog, String> {
    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        DcDimAppEventParameterCache.getInstance();
        DcDimAppEventSiteMappingCache.getInstance();
        DcDimEventActionTypeCache.getInstance();
        DcDimEventOperTypeRowCache.getInstance();
        DcDimEventBizTypeCache.getInstance();
    }
    @Override
    public void flatMap(AppLog model, Collector<String> collector) {
        if (model == null) return;
        switch (model.getLogType()) {
            case Constants.APP_EVENT_LOG_PREFIX:
                AppEventLogEntity entity = (AppEventLogEntity) model;
                AppEventLogEntity.joinAndFix(entity);
                for (String outLine : entity.getResult(entity)) {
                    if (StringUtils.isBlank(outLine)) continue;
                    collector.collect(outLine);
                }
                return;
            case Constants.APP_ACT_LOG_PREFIX:
            case Constants.APP_CLT_LOG_PREFIX:
                collector.collect(model.getResult());
                return;
        }
    }
}

During development, two major pitfalls were encountered: dimension‑table joins produced duplicate records because the state of the dimension stream never expired, leading to Cartesian products; this was solved by caching small dimension tables locally and larger ones in Redis with an LRU eviction policy. The second issue involved Flink state size: using the default MemoryStateBackend caused out‑of‑memory failures for large state, so the team evaluated FsStateBackend and RocksDBStateBackend, ultimately adopting RocksDB with incremental checkpoints for durability and scalability.

In conclusion, a real‑time data warehouse effectively complements an offline warehouse by providing low‑latency insights without compromising the overall data architecture. The implemented pipeline now supports user‑behavior analytics, integrates with downstream services, and continues to evolve by adopting industry‑leading streaming solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline Flink stream processing Kafka Real-Time Data Warehouse State Backend

Written by

HomeTech

HomeTech tech sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.