Real‑Time Data Warehouse Development with Flink: Architecture, Implementation, and Lessons Learned
This article describes the motivation, technology selection, implementation details, and practical challenges of building a real‑time data warehouse using Flink, covering stream ingestion, data cleaning, dimension‑table joins, state backend choices, and operational lessons for large‑scale streaming pipelines.
Traditional offline data warehouses focus on reusable dimensional and relational models, but they cannot meet the low‑latency requirements of many modern business scenarios, leading to data silos and resource waste when each real‑time application re‑processes raw streams.
To address the growing demand for timely data at the company, a five‑layer real‑time warehouse (ingestion → computation → storage → service → application) was planned, with the computation layer built on Apache Flink after comparing Storm, Spark‑Streaming, and Flink.
Flink was chosen for its high throughput, millisecond‑level latency, exactly‑once semantics, and flexible windowing. The overall pipeline reads Kafka topics, applies a Mapper for filtering and cleaning, a Flater for dimension‑table enrichment and flattening, and writes the results back to Kafka.
private static void proc(String[] args) throws Exception {
// 获取运行时环境
final StreamExecutionEnvironment env = CommonUtil.getOnlineStreamEnv(TimeCharacteristic.ProcessingTime);
// 创建kafkaconsumer
FlinkKafkaConsumer010 consumer = CommonUtil.genKafkaConsumer(Constants.TOPIC_WEB_APPLOG, args);
// source -> map(数据过滤,清洗) -> flatMap(关联维表) -> sink
DataStream
stream = env.addSource(consumer)
.map(new AppLogEntity.Mapper())
.flatMap(new AppLogEntity.Flater())
.returns(String.class);
stream.process(new ProcessFunction
() {
@Override
public void processElement(String value, Context ctx, Collector
out) {
out.collect(value);
}
}).addSink(CommonUtil.genUasKafkaProducer());
env.execute(JOB_NAME);
} public static class Mapper implements MapFunction
{
@Override
public AppLog map(String line) {
line = line.trim().replace("\007", "").replace("\t", "").replaceAll("\\n", "");
List
tokens = StringUtil.fastSplit(line, Constants.APP_LOG_ORIGINAL_SPERATOR, 40);
if (tokens.size() < 40) return null;
switch (tokens.get(0)) {
case Constants.APP_ACT_LOG_PREFIX: return new AppUsingLogEntity().parse(tokens);
case Constants.APP_EVENT_LOG_PREFIX: return new AppEventLogEntity().parse(tokens);
case Constants.APP_CLT_LOG_PREFIX: return new AppCltEntity().parse(tokens);
default: return null;
}
}
} public static class Flater extends RichFlatMapFunction
{
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
DcDimAppEventParameterCache.getInstance();
DcDimAppEventSiteMappingCache.getInstance();
DcDimEventActionTypeCache.getInstance();
DcDimEventOperTypeRowCache.getInstance();
DcDimEventBizTypeCache.getInstance();
}
@Override
public void flatMap(AppLog model, Collector
collector) {
if (model == null) return;
switch (model.getLogType()) {
case Constants.APP_EVENT_LOG_PREFIX:
AppEventLogEntity entity = (AppEventLogEntity) model;
AppEventLogEntity.joinAndFix(entity);
for (String outLine : entity.getResult(entity)) {
if (StringUtils.isBlank(outLine)) continue;
collector.collect(outLine);
}
return;
case Constants.APP_ACT_LOG_PREFIX:
case Constants.APP_CLT_LOG_PREFIX:
collector.collect(model.getResult());
return;
}
}
}During development, two major pitfalls were encountered: dimension‑table joins produced duplicate records because the state of the dimension stream never expired, leading to Cartesian products; this was solved by caching small dimension tables locally and larger ones in Redis with an LRU eviction policy. The second issue involved Flink state size: using the default MemoryStateBackend caused out‑of‑memory failures for large state, so the team evaluated FsStateBackend and RocksDBStateBackend , ultimately adopting RocksDB with incremental checkpoints for durability and scalability.
In conclusion, a real‑time data warehouse effectively complements an offline warehouse by providing low‑latency insights without compromising the overall data architecture. The implemented pipeline now supports user‑behavior analytics, integrates with downstream services, and continues to evolve by adopting industry‑leading streaming solutions.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.