Tagged articles
946 articles
Page 5 of 10
DataFunTalk
DataFunTalk
Aug 29, 2022 · Big Data

Migrating from Lambda Architecture to an Iceberg‑Based Unified Batch‑Stream Architecture at NetEase Yanxuan

This article details how NetEase Yanxuan upgraded its legacy Lambda data pipeline to a unified batch‑stream architecture built on Apache Iceberg, covering the original challenges, the evaluation of Iceberg versus Hudi and DeltaLake, implementation specifics, table‑governance techniques, and future roadmap.

Batch-StreamData LakeFlink
0 likes · 14 min read
Migrating from Lambda Architecture to an Iceberg‑Based Unified Batch‑Stream Architecture at NetEase Yanxuan
DataFunSummit
DataFunSummit
Aug 25, 2022 · Big Data

Managing the Full Lifecycle of Risk Features: Pitfalls, Solutions, and Future Directions

The talk by Tang Gengyang from Citic Baixin Bank details the challenges faced in risk feature engineering, presents two solution frameworks (1.0 and 2.0) for accelerating deployment, improving reuse, handling offline/online consistency, and outlines future enhancements for a more efficient, automated feature pipeline.

Flinkasynchronous processingdata pipelines
0 likes · 14 min read
Managing the Full Lifecycle of Risk Features: Pitfalls, Solutions, and Future Directions
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 25, 2022 · Big Data

How Alibaba Cloud Flink + Hologres Power Real‑Time Data Warehouses

This article explains how Alibaba Cloud Flink and Hologres combine to deliver a one‑stop, cloud‑native real‑time data‑warehouse solution that supports low‑latency ingestion, full‑incremental CDC, automatic schema evolution, high‑performance OLAP and online serving, and simplifies ETL/ELT pipelines for enterprise analytics.

FlinkHologrescloud computing
0 likes · 25 min read
How Alibaba Cloud Flink + Hologres Power Real‑Time Data Warehouses
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 23, 2022 · Big Data

Using Flink Broadcast State for Dynamic Configuration Updates and Real‑Time Data Enrichment

This article explains how Flink's Broadcast State feature can be used to dynamically update processing rules and enrich streaming events with user information from MySQL, showing configuration, code examples, key considerations, and runtime results that demonstrate real‑time adaptability without restarting the job.

Broadcast StateDynamic ConfigurationFlink
0 likes · 15 min read
Using Flink Broadcast State for Dynamic Configuration Updates and Real‑Time Data Enrichment
Big Data Technology Architecture
Big Data Technology Architecture
Aug 23, 2022 · Big Data

Apache Hudi 0.12.0 Release Highlights: Presto Connector, Archive Beyond Savepoint, File‑System Locks, Deltastreamer Termination, Spark & Flink Support, Performance Improvements, and Configuration Updates

The Apache Hudi 0.12.0 release introduces a native Presto connector, archive‑beyond‑savepoint capability, file‑system based locking, new deltastreamer termination strategies, expanded Spark and Flink support, numerous performance enhancements, and a series of configuration and API updates for better data‑lake management.

Apache HudiFlinkPresto
0 likes · 12 min read
Apache Hudi 0.12.0 Release Highlights: Presto Connector, Archive Beyond Savepoint, File‑System Locks, Deltastreamer Termination, Spark & Flink Support, Performance Improvements, and Configuration Updates
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 15, 2022 · Big Data

How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

This article explains how ByteDance’s event‑tracking (埋点) data flow handles billions of events per second using Flink‑based real‑time ETL, dynamic rule engines, data sharding, and multi‑datacenter disaster‑recovery to ensure stability, low latency, and cost‑effective processing for diverse downstream services.

Big DataFlinkScalability
0 likes · 16 min read
How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline
ITPUB
ITPUB
Aug 13, 2022 · Big Data

How Alibaba Uses Flink to Power Massive Real‑Time Risk Control

This article explains how Alibaba leverages Flink to handle over 40 billion events per second across all business units, detailing risk‑control concepts, rule types, architectural stages, resource tuning, dynamic CEP, shared computing, and the FY23 roadmap for large‑scale streaming risk management.

AlibabaBig DataCEP
0 likes · 16 min read
How Alibaba Uses Flink to Power Massive Real‑Time Risk Control
DaTaobao Tech
DaTaobao Tech
Aug 11, 2022 · Big Data

Unify SQL Engine: Integrating Stream, Batch, and Online Computing for Data Warehousing

The article describes how fragmented real‑time, batch, and online data‑warehouse pipelines suffer from low productivity and inconsistent data quality, and introduces a unified SQL engine built on Apache Calcite that parses, optimizes, and compiles a single SQL statement into executable plans for ODPS, Flink, or Java, leveraging Janino code generation, multi‑backend state storage, and snapshot‑join semantics to boost performance and simplify development.

Batch ProcessingCalciteFlink
0 likes · 16 min read
Unify SQL Engine: Integrating Stream, Batch, and Online Computing for Data Warehousing
DataFunTalk
DataFunTalk
Aug 6, 2022 · Big Data

Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg

This article details Xiaohongshu's data platform engineering, describing how Apache Iceberg is leveraged for real‑time data lake ingestion, CDC pipelines, multi‑cloud storage, small‑file mitigation, schema evolution, and future plans across storage, compute, and management within a big‑data ecosystem.

Apache IcebergCDCFlink
0 likes · 16 min read
Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 4, 2022 · Big Data

Boost Real‑Time Data Warehouses with Integrated Analytics & Service

Alibaba Cloud’s Hologres unifies analytical and service workloads in a real‑time data warehouse, simplifying data exchange, reducing development and operational costs, and delivering high‑performance, low‑latency online services through innovations like row‑column hybrid storage, hot upgrades, and elastic cloud‑native scaling, as demonstrated in a logistics case study.

FlinkHologresReal-Time
0 likes · 13 min read
Boost Real‑Time Data Warehouses with Integrated Analytics & Service
DataFunTalk
DataFunTalk
Jul 26, 2022 · Big Data

Feature Platform Architecture and Stream‑Batch Integrated Solutions

This talk presents Shuhe Technology’s feature platform, detailing its four‑layer architecture, feature storage services, stream‑batch integrated processing, event‑center design, consistency models, and four model‑strategy invocation schemes, illustrating data flows from MySQL through Sqoop, Kafka, Flink, HBase and ClickHouse.

Big DataFlinkHBase
0 likes · 17 min read
Feature Platform Architecture and Stream‑Batch Integrated Solutions
JavaEdge
JavaEdge
Jul 25, 2022 · Big Data

Choosing Between Lambda and Kappa: Real‑Time Data Warehouse Strategies

The article uses an acorn‑moving analogy to highlight latency and traceability challenges in enterprise data warehouses, then explains offline versus real‑time approaches, compares Lambda and Kappa architectures, discusses Iceberg integration, and shares a detailed e‑commerce real‑time warehouse case study with optimization tips.

Big DataFlinkIceberg
0 likes · 15 min read
Choosing Between Lambda and Kappa: Real‑Time Data Warehouse Strategies
ITPUB
ITPUB
Jul 22, 2022 · Big Data

From Client‑Side to Server‑Side: How NetEase Built StreamflySQL on Flink SQL

This article chronicles NetEase Games' evolution of its real‑time StreamflySQL platform, detailing the transition from a client‑side Flink SQL implementation to a server‑side architecture powered by SQL Gateway, and discusses the motivations, design choices, challenges, and performance improvements achieved.

Big DataFlinkSQL Gateway
0 likes · 19 min read
From Client‑Side to Server‑Side: How NetEase Built StreamflySQL on Flink SQL
HomeTech
HomeTech
Jul 20, 2022 · Big Data

Design and Implementation of a Real-Time Advertising Data Warehouse Using Flink and StarRocks

This article presents a comprehensive case study of building a real‑time advertising data warehouse at Auto Home, detailing the evaluation of streaming engines and storage solutions, the layered architecture design, implementation steps with Flink and StarRocks, monitoring practices, encountered issues, and future roadmap, demonstrating how second‑level data freshness and high accuracy were achieved.

FlinkStarRocksStreaming
0 likes · 10 min read
Design and Implementation of a Real-Time Advertising Data Warehouse Using Flink and StarRocks
StarRocks
StarRocks
Jul 18, 2022 · Big Data

How Songguo Mobility Built a Real‑Time OLAP Platform with StarRocks: From 1.0 to 3.0

Songguo Mobility’s data‑center team migrated from a fragmented Impala‑Kudu‑ClickHouse stack to a unified StarRocks‑based real‑time OLAP architecture, iterating through three versions to solve scalability, latency, and maintenance challenges while supporting minute‑level dashboards for orders and vehicle analytics.

FlinkKafkaReal-time OLAP
0 likes · 19 min read
How Songguo Mobility Built a Real‑Time OLAP Platform with StarRocks: From 1.0 to 3.0
DataFunSummit
DataFunSummit
Jul 17, 2022 · Big Data

Elasticsearch and Big Data: Architecture, Use Cases, and Advantages

This article explains what Elasticsearch is, how it solves database acceleration, log observability, and data analysis problems, details its core components and underlying engine features, compares its strengths and weaknesses, and presents classic application scenarios and a real‑world case study integrating Elasticsearch with Flink for large‑scale log analytics.

Big DataElasticsearchFlink
0 likes · 13 min read
Elasticsearch and Big Data: Architecture, Use Cases, and Advantages
Big Data Technology Architecture
Big Data Technology Architecture
Jul 15, 2022 · Big Data

Using and Designing the Apache SeaTunnel Examples Module

This article introduces Apache SeaTunnel's Examples module, compares SeaTunnel with DataX, explains its multi‑engine design, demonstrates Flink and Spark example implementations, and shares the speaker's experiences contributing to the open‑source community, providing practical guidance for big‑data integration projects.

Apache SeaTunnelData IntegrationFlink
0 likes · 10 min read
Using and Designing the Apache SeaTunnel Examples Module
DataFunTalk
DataFunTalk
Jul 14, 2022 · Big Data

Real‑Time Data Lake Practices at ByteDance and Alibaba: Architecture, Challenges, and Solutions

This article presents detailed case studies of ByteDance and Alibaba implementing real‑time data lake solutions with Hudi and Flink, describing the business drivers, architectural challenges, and the specific technical strategies such as unified metadata layers, optimistic locking, scalable hash indexing, and CDC‑based incremental ETL to achieve low‑latency, high‑throughput data processing.

FlinkHudiReal-time Data Lake
0 likes · 9 min read
Real‑Time Data Lake Practices at ByteDance and Alibaba: Architecture, Challenges, and Solutions
Hulu Beijing
Hulu Beijing
Jul 7, 2022 · Big Data

How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration

This article details Hulu's five‑year journey from Hadoop 2.6 to 3.3.2, covering major feature evolutions, the original cluster architecture, a comprehensive upgrade plan, compatibility challenges across HDFS, YARN, Hive, Spark and Flink, and the testing and rollout strategies that ensured a smooth migration.

Big DataCluster UpgradeCompatibility
0 likes · 17 min read
How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration
HelloTech
HelloTech
Jul 6, 2022 · Big Data

Investigation and Resolution of Elasticsearch Write Timeout Issues in a Real-Time Flink Data Sync Pipeline

The team diagnosed intermittent Elasticsearch write‑timeout failures in their real‑time Flink‑to‑Elasticsearch pipeline as lock contention from frequent duplicate updates to the same document IDs, and eliminated the issue by aggregating binlog events in a 5‑second sliding window to deduplicate writes, adjusting refresh intervals, using async translog durability, and disabling non‑essential fields.

Big DataElasticsearchFlink
0 likes · 7 min read
Investigation and Resolution of Elasticsearch Write Timeout Issues in a Real-Time Flink Data Sync Pipeline
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 28, 2022 · Big Data

How Kuaishou Guarantees Real‑Time Data Warehouse Reliability During Billion‑Scale Events

This article details Kuaishou’s real‑time data warehouse architecture and its comprehensive assurance framework—including forward lifecycle standards, reverse fault‑injection testing, and Spring Festival event practices—highlighting challenges of massive traffic, high timeliness, accuracy, and stability, and outlining future plans for automation, batch‑stream integration, and cost reduction.

FlinkReal-time StreamingSLA
0 likes · 23 min read
How Kuaishou Guarantees Real‑Time Data Warehouse Reliability During Billion‑Scale Events
DataFunTalk
DataFunTalk
Jun 28, 2022 · Big Data

JD Retail Traffic Data Warehouse Architecture and Processing Practices

This article presents a comprehensive technical overview of JD.com’s retail traffic data processing pipeline, detailing the multi‑layer data warehouse architecture, real‑time and offline data flows, a large‑scale back‑fill case using Iceberg and OLAP, data‑skew detection and mitigation techniques, and future directions involving unified Flink‑Spark streaming‑batch solutions.

Data SkewFlinkIceberg
0 likes · 12 min read
JD Retail Traffic Data Warehouse Architecture and Processing Practices
Zuoyebang Tech Team
Zuoyebang Tech Team
Jun 17, 2022 · Big Data

How FlinkSQL Auto‑Tuning Saves Resources and Guarantees SLA

This article describes the design and implementation of an automated FlinkSQL tuning system that monitors metrics, evaluates task health with rule‑based logic, calculates optimal resource adjustments, and performs fast scaling to reduce cluster waste, lower operational costs, and maintain SLA compliance.

AkkaAuto ScalingFlink
0 likes · 15 min read
How FlinkSQL Auto‑Tuning Saves Resources and Guarantees SLA
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 14, 2022 · Big Data

Can a Streaming Data Warehouse Balance Freshness, Latency, and Cost?

This article examines the core trade‑offs of data warehouses—freshness, query latency, and cost—compares offline and real‑time architectures, introduces the concept of a streaming data warehouse, and details how Apache Flink Table Store aims to provide a unified, low‑cost solution.

Big DataFlinkReal-time analytics
0 likes · 19 min read
Can a Streaming Data Warehouse Balance Freshness, Latency, and Cost?
JD Retail Technology
JD Retail Technology
Jun 10, 2022 · Big Data

Design and Implementation of an International Business Data Platform for JD.com's 618 Promotion

The article details JD International's challenges and solutions in building a unified, real‑time data platform for its multi‑regional 618 promotion, covering business characteristics, data distribution, team organization, dashboard architecture, integration strategies, and short‑ and long‑term technical plans.

Data IntegrationData PlatformFlink
0 likes · 8 min read
Design and Implementation of an International Business Data Platform for JD.com's 618 Promotion
Bilibili Tech
Bilibili Tech
Jun 10, 2022 · Big Data

Incremental Data Lake Design and Hudi Core Optimizations with Flink

The article describes how combining Apache Flink with Hudi enables an incremental data lake that delivers near‑real‑time analytics by switching to merge‑on‑read, fixing log handling bugs, improving compaction planning, and refactoring table‑service scheduling, while showcasing use cases such as CDC ingestion, data quality control, and real‑time materialized views, and outlines future enhancements like optimistic concurrency and unified schema evolution.

Apache HudiCDCCompaction Optimization
0 likes · 21 min read
Incremental Data Lake Design and Hudi Core Optimizations with Flink
Zuoyebang Tech Team
Zuoyebang Tech Team
Jun 7, 2022 · Big Data

How Doris Powered Zuoyebang’s Real‑Time Data Warehouse for Faster Insights

Zuoyebang’s data team replaced fragmented, slow query solutions with Apache Doris, building a unified real‑time data warehouse that dramatically cut query latency from hours to seconds, streamlined data modeling, and improved reliability across diverse business scenarios, while integrating with Flink, Kafka, and ES via a unified API.

Apache DorisElasticsearchFlink
0 likes · 20 min read
How Doris Powered Zuoyebang’s Real‑Time Data Warehouse for Faster Insights
dbaplus Community
dbaplus Community
May 24, 2022 · Big Data

How Vipshop Replaced ELK with ClickHouse for a Scalable, Low‑Cost Log System

Vipshop’s Dragonfly log platform evolved from a costly 260‑node Elasticsearch cluster to a ClickHouse‑based architecture that uses a unified JSON format, vfilebeat ingestion, Flink parsing, and MergeTree storage to achieve high‑throughput writes, fast vectorized queries, flexible TTL management, and dramatically lower operational expenses.

EFKFlinkKafka
0 likes · 20 min read
How Vipshop Replaced ELK with ClickHouse for a Scalable, Low‑Cost Log System
DataFunTalk
DataFunTalk
May 24, 2022 · Big Data

Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake

This article explains how Apache Flink integrates with Apache Hudi to enable real‑time data lake ingestion, covering the evolution from traditional data warehouses to data lakes, Hudi’s core concepts such as timeline and file grouping, copy‑on‑write vs merge‑on‑read modes, and Flink’s CDC‑based ETL pipeline.

Big DataCDCData Lake
0 likes · 18 min read
Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake
DataFunSummit
DataFunSummit
May 21, 2022 · Big Data

Tencent News Massive Log Processing Architecture and Data Applications

The article presents Tencent News' comprehensive massive log processing solution, covering background, overall architecture, data collection, real-time and offline computation layers, data quality assurance, and practical examples such as Flink CDC for database synchronization, illustrating how large‑scale data is managed and applied.

FlinkLog ProcessingTencent
0 likes · 10 min read
Tencent News Massive Log Processing Architecture and Data Applications
Big Data Technology & Architecture
Big Data Technology & Architecture
May 15, 2022 · Big Data

Understanding Flink Window Table-Valued Functions (TVF) and Incremental Optimization

This article explains the concept of window table-valued functions in Flink, compares the old grouped‑window syntax with the new TVF syntax, details the physical and execution plans, introduces sliced windows for state reduction, and presents a small incremental‑output improvement with code examples.

Big DataFlinkIncremental Aggregation
0 likes · 12 min read
Understanding Flink Window Table-Valued Functions (TVF) and Incremental Optimization
Zuoyebang Tech Team
Zuoyebang Tech Team
May 9, 2022 · Big Data

How Flink SQL Powered Real‑Time Learning Analytics at Zuoyebang

Zuoyebang’s big‑data team shares how they evolved from SparkStreaming to a Flink‑SQL‑centric real‑time platform, detailing three development stages, challenges in DAG optimization, Redis‑based table design, and platform features for unified deployment, ease of use, and operational governance.

FlinkReal-TimeStreaming
0 likes · 14 min read
How Flink SQL Powered Real‑Time Learning Analytics at Zuoyebang
58 Tech
58 Tech
May 5, 2022 · Big Data

Low-Code Real-Time Data Warehouse Construction System Using Flink

This article describes a low‑code, Flink‑based real‑time data‑warehouse construction system that abstracts the warehouse building process into ODS, DWD, DWS, and ADS layers, leverages a domain‑specific language and plugin engine to reduce development effort, and details its architecture, DSL design, plugin extensibility, dimension‑table completion, stream merging, aggregation, and storage strategies.

Big DataDSLFlink
0 likes · 11 min read
Low-Code Real-Time Data Warehouse Construction System Using Flink
Big Data Technology & Architecture
Big Data Technology & Architecture
May 4, 2022 · Big Data

Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities

The Apache Hudi 0.11.0 release introduces multi‑mode metadata indexing, enhanced data‑skipping, asynchronous indexing, extensive Spark and Flink integration improvements, new bundle utilities, and expanded metadata synchronization with BigQuery, AWS Glue, and DataHub, while also adding bucket indexing and encryption support.

Apache HudiAsync IndexBig Data
0 likes · 13 min read
Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities
Bilibili Tech
Bilibili Tech
May 3, 2022 · Artificial Intelligence

Bilibili AI Collaboration Platform Based on AIFlow: Architecture, Evolution, and Stream‑Batch Fusion

Bilibili built an AI collaboration platform based on AIFlow to simplify real-time machine-learning workflows, evolving through three stages that added event-driven scheduling, UI-driven parameter management, version snapshots, and a stateless client-server design, while enabling stream-batch fusion for feature back-filling; future work targets high availability, Airflow 2.0 compatibility, and richer streaming ML operators.

AIFlowBilibiliFlink
0 likes · 17 min read
Bilibili AI Collaboration Platform Based on AIFlow: Architecture, Evolution, and Stream‑Batch Fusion
HomeTech
HomeTech
Apr 27, 2022 · Big Data

AutoStream Real‑Time Computing Platform: Architecture, Resource Management, Scaling, Lakehouse Integration, and PyFlink Practices

This article details Car Home's AutoStream platform evolution from Storm to Flink‑based versions, covering real‑time application scenarios, strict budget‑controlled resource management, automatic scaling, lake‑house architecture with Iceberg, PyFlink integration, and future plans for resource optimisation and batch‑stream unification.

AutoStreamFlinkLakehouse
0 likes · 15 min read
AutoStream Real‑Time Computing Platform: Architecture, Resource Management, Scaling, Lakehouse Integration, and PyFlink Practices
DataFunSummit
DataFunSummit
Apr 22, 2022 · Big Data

Huya Real-Time Computing SLA Practice: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook

The talk details Huya’s real‑time computing platform evolution from chaotic early stages to a unified, containerized system, defines core SLA metrics focused on latency compliance, describes capability enhancements such as demand monitoring, task analysis, dynamic scaling, and outlines future goals for usability, stability, openness, and unified stream‑batch processing.

FlinkReal‑Time ComputingSLA
0 likes · 12 min read
Huya Real-Time Computing SLA Practice: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook
ITPUB
ITPUB
Apr 19, 2022 · Big Data

Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive

This article explains why modern enterprises need real‑time data‑warehouse architectures, breaks down traditional layered warehouse concepts, compares Lambda and Kappa models, evaluates five practical real‑time solutions—including Iceberg‑based lakehouse and MPP databases—provides code snippets, and offers selection guidance with real‑world company examples.

Big DataFlinkIceberg
0 likes · 19 min read
Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 15, 2022 · Big Data

Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query

This guide explains how to set up the Flink SQL client to work with Apache Iceberg, covering Scala version requirements, downloading and deploying Iceberg jars, configuring Hive and HDFS catalogs, creating databases and tables, performing insert and overwrite operations, and querying data in both batch and streaming modes.

Big DataCatalogFlink
0 likes · 18 min read
Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query
DataFunTalk
DataFunTalk
Apr 15, 2022 · Big Data

Huya Real-Time Computing SLA Practices: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook

This article details Huya's real‑time computing platform evolution, core SLA definitions focused on latency compliance, capability enhancements such as demand management, task analysis, dynamic resource scaling, and outlines future directions emphasizing usability, stability, openness, and unified batch‑stream processing.

FlinkReal‑Time ComputingSLA
0 likes · 13 min read
Huya Real-Time Computing SLA Practices: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook
Shopee Tech Team
Shopee Tech Team
Apr 14, 2022 · Big Data

URL Normalization and Statistical Analysis in MDAP Using Probabilistic and Machine Learning Techniques

MDAP normalizes URLs by automatically learning pattern‑tree rule models using entropy‑based splits, gibberish and numeric detection, and scalable Flink processing, which groups millions of raw URLs into concise patterns for accurate statistical monitoring, dramatically reducing data noise while still facing latency and model‑iteration challenges.

Flinkmachine learningpattern tree
0 likes · 20 min read
URL Normalization and Statistical Analysis in MDAP Using Probabilistic and Machine Learning Techniques
dbaplus Community
dbaplus Community
Apr 13, 2022 · Big Data

How Meituan Built a Scalable Real‑Time Data Warehouse with Flink

This article explains Meituan's real‑time data warehouse architecture, covering typical business scenarios, the evolution of its streaming platform, key design challenges, solutions such as unified data models, SQL‑based development, UDF hosting, operator optimizations, and future plans for incremental processing and unified batch‑stream semantics.

FlinkMeituanreal-time data
0 likes · 18 min read
How Meituan Built a Scalable Real‑Time Data Warehouse with Flink
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 11, 2022 · Big Data

Real-Time Data Warehouse Construction: Background, Objectives, Architecture, and Case Studies

This article explains the growing demand for real‑time data warehouses, outlines their objectives and layered architecture, and presents detailed case studies from Didi, Kuaishou, Tencent, Youzan and others, illustrating design choices, implementation challenges, and best practices for building scalable streaming data platforms.

FlinkKafkabig-data
0 likes · 48 min read
Real-Time Data Warehouse Construction: Background, Objectives, Architecture, and Case Studies
DataFunSummit
DataFunSummit
Apr 6, 2022 · Big Data

Real-time Dimension Modeling with Flink SQL: Challenges and Solutions

This article presents a JD.com case study on applying Flink SQL for real‑time dimension modeling, detailing two complex streaming scenarios—full‑join of multiple streams and full‑group aggregation—along with the associated challenges of historical data handling, state management, and performance optimization, and proposes component‑based architectural solutions.

Big DataFlinkReal-Time
0 likes · 14 min read
Real-time Dimension Modeling with Flink SQL: Challenges and Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 5, 2022 · Big Data

Using ElasticsearchSink with Apache Flink: Configuration, Retry Strategies, and Failure Handling

This article introduces the ElasticsearchSink for Apache Flink, explains how to add Maven dependencies, implement the sink with configuration and retry settings, details failure handlers, and highlights important considerations such as exception handling and checkpoint requirements for reliable streaming pipelines.

Big DataElasticsearchFailure Handling
0 likes · 9 min read
Using ElasticsearchSink with Apache Flink: Configuration, Retry Strategies, and Failure Handling

Data Lake Construction and Practice at NetEase Yanxuan

NetEase Yanxuan replaced its cumbersome data‑warehouse with a flexible Delta‑Lake/Iceberg data lake, creating a unified metadata layer and real‑time ingestion pipelines that cut latency from nightly batches to seconds, slashed compute and storage costs, supported diverse business scenarios and machine‑learning feature engineering, and set the stage for broader future expansion.

Data IntegrationData LakeDelta Lake
0 likes · 16 min read
Data Lake Construction and Practice at NetEase Yanxuan
Efficient Ops
Efficient Ops
Mar 29, 2022 · Big Data

How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations

This article explains how Tencent Cloud's APM metric calculation, which transforms massive Span data into aggregated metrics using Flink, faced performance bottlenecks and was optimized through job splitting, batch merging, and dimension pruning, ultimately achieving a 2‑3× speed increase and cutting resource usage to about 30% of the original.

APMBig DataFlink
0 likes · 10 min read
How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations
58 Tech
58 Tech
Mar 29, 2022 · Big Data

Design and Implementation of the 58 Group Penalty Data Center

This article presents the design, architecture, and implementation of a unified penalty data center for 58 Group, detailing the challenges of heterogeneous data sources, the selection of Flink for real‑time ETL, the use of a DSL and LRU aggregation, and the adoption of MVEL for feature recognition to achieve standardized, high‑performance penalty data processing.

Big DataETLFlink
0 likes · 13 min read
Design and Implementation of the 58 Group Penalty Data Center
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 28, 2022 · Big Data

Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents JD's real-time dimension modeling case using Flink SQL, detailing two complex streaming scenarios, the difficulties of handling historical data and state management, and a component‑based solution that leverages external KV stores and optimized Flink operators to improve performance and scalability.

Big DataFlinkReal-Time
0 likes · 13 min read
Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions
StarRocks
StarRocks
Mar 28, 2022 · Backend Development

Scaling Microservice Tracing with Zipkin and StarRocks: A Practical Guide

This article explains how Sohu Smart Media built a high‑performance tracing system for microservices by integrating Zipkin for data collection with StarRocks for storage and analytics, covering architecture, data models, SQL queries, Flink processing, and real‑world results that boost observability and engineering efficiency.

FlinkMicroservicesStarRocks
0 likes · 31 min read
Scaling Microservice Tracing with Zipkin and StarRocks: A Practical Guide
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 24, 2022 · Big Data

How Flink Powers Real‑Time Process Operations in China Construction Bank

This article details how China Construction Bank's fintech subsidiary leveraged Apache Flink to ingest, join, and analyze massive front‑end, request, and response logs in real time, overcoming data silos, latency challenges, and state‑management issues to enable end‑to‑end process visibility and operational optimization.

BankingFlinkprocess mining
0 likes · 17 min read
How Flink Powers Real‑Time Process Operations in China Construction Bank
DataFunTalk
DataFunTalk
Mar 24, 2022 · Big Data

Real‑time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents a JD.com BI engineer's case study on applying Flink SQL to real‑time dimension modeling, detailing two complex streaming scenarios, the technical difficulties of handling historical data and performance, and a component‑based solution architecture with future roadmap considerations.

Big DataFlinkReal-Time
0 likes · 13 min read
Real‑time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions
StarRocks
StarRocks
Mar 23, 2022 · Databases

Accelerating Zepp Health’s Analytics with StarRocks: An OLAP Case Study

Facing inflexible point‑lookup limits and slow query times on HBase, Zepp Health redesigned its massive event‑tracking data pipeline—migrating ingestion through Kafka, Flink, and Hudi to a StarRocks‑based OLAP layer—achieving sub‑100 ms average query latency, 20 % storage savings, and dramatically faster multi‑dimensional analytics.

Big DataFlinkHudi
0 likes · 9 min read
Accelerating Zepp Health’s Analytics with StarRocks: An OLAP Case Study
DataFunTalk
DataFunTalk
Mar 23, 2022 · Big Data

Iceberg Data Lake Query Optimization Practices and Governance

This talk by Tencent senior engineer Chen Liang covers Iceberg table format fundamentals, data lake ingestion, query processing, hidden partitioning, time‑travel, major features, optimization techniques such as compaction, bin‑packing, sorting and Z‑ordering, and outlines a future roadmap for improving performance and governance in big‑data environments.

Big DataData LakeFlink
0 likes · 12 min read
Iceberg Data Lake Query Optimization Practices and Governance
DeWu Technology
DeWu Technology
Mar 21, 2022 · Big Data

Real-time Customer Service Dashboard: Architecture and Implementation with Flink and ClickHouse

The article describes a real‑time customer‑service dashboard built on Flink for streaming MySQL changes captured via Kafka, which cleans and aggregates ~60 operational metrics before writing them to ClickHouse’s MergeTree/ReplacingMergeTree tables, enabling sub‑second queries and exactly‑once guarantees while separating offline and live pipelines.

DashboardFlinkclickhouse
0 likes · 18 min read
Real-time Customer Service Dashboard: Architecture and Implementation with Flink and ClickHouse
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 17, 2022 · Big Data

How AutoStream Scales Real‑Time Data Processing with Flink, Iceberg, and PyFlink

This article details AutoStream's evolution from a Java‑only Storm platform to a Flink‑based, Kubernetes‑native streaming system that integrates budgeting controls, automatic scaling, lakehouse architecture with Iceberg, and PyFlink support, highlighting the technical challenges, solutions, and future roadmap for real‑time analytics.

FlinkIcebergLakehouse
0 likes · 23 min read
How AutoStream Scales Real‑Time Data Processing with Flink, Iceberg, and PyFlink
Yiche Technology
Yiche Technology
Mar 9, 2022 · Cloud Native

Design and Implementation of the Yunji Logging System Using Flink and ClickHouse

The article presents the Yunji logging system, a Flink+ClickHouse-based cloud-native platform for real-time ingestion, storage, querying, analysis, and monitoring of massive heterogeneous logs, covering its architecture, configuration center, storage design, processing flow, monitoring features, and future enhancements.

Cloud NativeFlinkJanino
0 likes · 21 min read
Design and Implementation of the Yunji Logging System Using Flink and ClickHouse
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 7, 2022 · Big Data

How China Mobile’s Real‑Time Computing Platform Scales Billions of Events with Flink

This article details China Mobile (Suzhou) Software Technology's evolution from Storm to Flink for real‑time computing, its multi‑version engine and log‑retrieval designs, signal‑business data pipeline optimizations, stability practices around ZooKeeper, and future directions in resource scaling and data‑lake integration.

FlinkKafkaReal-Time
0 likes · 12 min read
How China Mobile’s Real‑Time Computing Platform Scales Billions of Events with Flink
dbaplus Community
dbaplus Community
Mar 2, 2022 · Big Data

How Real‑Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

This article explores the growing demand for real‑time data warehouses, compares them with traditional offline warehouses, and presents detailed architectures, layer designs, naming conventions, and case studies from companies like Didi, Kuaishou, Tencent, and Youzan, highlighting challenges, solutions, and performance optimizations.

Big Data ArchitectureFlinkIceberg
0 likes · 47 min read
How Real‑Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices
DataFunTalk
DataFunTalk
Mar 1, 2022 · Cloud Native

Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions

The presentation outlines Alibaba Cloud's native data lake solution built on Apache Iceberg, covering data lake fundamentals, cloud migration challenges, Iceberg's architecture and features, real‑time ingestion with Flink, unified metadata management, security guarantees, and testing practices to ensure reliable, scalable big‑data analytics.

Apache IcebergBig DataData Lake
0 likes · 16 min read
Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions
DataFunTalk
DataFunTalk
Feb 25, 2022 · Big Data

Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization

This article explains how Tencent leverages Apache Iceberg together with Flink to build a real‑time data lake pipeline, covering data ingestion, Iceberg's snapshot‑based read/write model, compaction and governance services, Z‑order based query optimization, performance results, and future roadmap.

Apache IcebergBig DataData Lake
0 likes · 24 min read
Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization
dbaplus Community
dbaplus Community
Feb 23, 2022 · Big Data

Inside OPPO’s Real‑Time Computing Platform: Architecture, Practices, and Future Roadmap

This article details OPPO’s real‑time computing platform, covering its business scope, big‑data architecture built on Flink, Spark and Trino, the end‑to‑end job development lifecycle, SQL IDE features, diagnostic and monitoring mechanisms, link latency tracking, SLA guarantees, practical use cases, and upcoming lakehouse and cloud‑native evolution.

FlinkReal‑Time Computingbig data platform
0 likes · 23 min read
Inside OPPO’s Real‑Time Computing Platform: Architecture, Practices, and Future Roadmap
vivo Internet Technology
vivo Internet Technology
Feb 23, 2022 · Big Data

Kafka-based Real-Time Data Warehouse: Architecture and Practice for Search

The article explains how Kafka serves as the core of a real‑time data warehouse for search, detailing its advantages over traditional databases, integration with Flink for low‑latency stream processing, architectural patterns such as Lambda/Kappa, scaling challenges, and comprehensive monitoring using Kafka Eagle.

Apache KafkaData IntegrationFlink
0 likes · 15 min read
Kafka-based Real-Time Data Warehouse: Architecture and Practice for Search
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 23, 2022 · Big Data

Understanding Mini‑Batch Streaming Aggregation in Flink SQL

This article explains Flink SQL’s streaming aggregation Mini‑Batch feature, covering its purpose, configuration parameters, underlying optimizer rules, operator implementations, watermark handling, buffer processing, and the optional Local‑Global two‑phase aggregation optimization for improving throughput and reducing state overhead in large‑scale data pipelines.

Big DataFlinkMini-Batch
0 likes · 10 min read
Understanding Mini‑Batch Streaming Aggregation in Flink SQL
Volcano Engine Developer Services
Volcano Engine Developer Services
Feb 16, 2022 · Big Data

ByteDance’s Journey to a Unified Data Lake with Flink and Hudi

This article recounts ByteDance’s evolution from batch‑only Flink pipelines to a unified data‑lake integration platform, detailing the three integration modes, challenges with Spark‑based CDC, the decision to adopt Hudi over Iceberg, and how Hudi’s indexing and Merge‑On‑Read formats enable near‑real‑time analytics at massive scale.

CDCFlinkHudi
0 likes · 10 min read
ByteDance’s Journey to a Unified Data Lake with Flink and Hudi
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 16, 2022 · Big Data

Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse

This article introduces Change Data Capture (CDC), compares query‑based and log‑based approaches, explains Debezium and ClickHouse, and provides detailed Flink CDC and Flink SQL CDC examples—including Java source code, custom deserialization schema, ClickHouse sink implementation, and required Maven dependencies—to synchronize MySQL data into ClickHouse in real time.

Big DataCDCData Streaming
0 likes · 17 min read
Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 14, 2022 · Backend Development

How Kuaishou Boosted Flink SQL Performance with Window Extensions and State Optimizations

Kuaishou dramatically increased Flink SQL adoption, introduced Group Window Aggregate and Window TVF extensions, applied aggregation state reuse and mini‑batch techniques, and enhanced stability through data‑skew mitigation and aggregate‑state compatibility, outlining future plans for streaming and batch SQL improvements.

FlinkState Optimizationsql
0 likes · 19 min read
How Kuaishou Boosted Flink SQL Performance with Window Extensions and State Optimizations
DataFunTalk
DataFunTalk
Feb 12, 2022 · Big Data

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

This article introduces NetEase's internally incubated data lake project Arctic, explains the concept of data lakes, outlines NetEase's specific requirements for a unified streaming‑batch platform, details Arctic's core architecture, storage strategy, data‑merge mechanisms, current achievements, and future development plans.

Apache IcebergArcticBig Data
0 likes · 10 min read
NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap
DataFunTalk
DataFunTalk
Feb 3, 2022 · Big Data

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

This article explains how Kuashou tackled latency and efficiency problems in large‑scale data pipelines by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, its architecture, model design, handling of bursty updates, back‑fill scenarios, and operational safeguards.

Big DataData LakeFlink
0 likes · 13 min read
Improving Data Processing Efficiency at Kuaishou with Apache Hudi
DataFunSummit
DataFunSummit
Jan 30, 2022 · Big Data

Real‑time Data Warehouse at Meituan: Architecture, Challenges, and Solutions

This article presents Meituan's real‑time data warehouse platform, describing typical streaming use cases, the evolution of its architecture from Storm and Spark Streaming to Flink, the challenges of development, operations and data quality, and the engineering solutions—including unified SQL, web IDE, UDF hosting, pipeline testing, and operator performance optimizations—implemented to support large‑scale, low‑latency analytics.

Flinkplatform architecturereal-time data
0 likes · 17 min read
Real‑time Data Warehouse at Meituan: Architecture, Challenges, and Solutions
Baidu Geek Talk
Baidu Geek Talk
Jan 26, 2022 · Big Data

How a Real‑Time CDP Solves Data Silos: Architecture, Tech Choices & Lessons

This article examines the design and implementation of a tenant‑level real‑time Customer Data Platform, detailing CDP fundamentals, business and technical challenges, key architectural components, technology selections such as graph databases, stream processing, storage engines, and the operational practices that enable high‑throughput, low‑latency data integration and analytics.

CDPData IntegrationFlink
0 likes · 42 min read
How a Real‑Time CDP Solves Data Silos: Architecture, Tech Choices & Lessons
HomeTech
HomeTech
Jan 26, 2022 · Operations

Design and Practice of Autohome's Performance Testing Platform PTS

The article details the architecture, key components, testing types, and operational results of Autohome's PTS platform, which uses Docker Swarm, gRPC, JMeter, Flume‑Kafka, and Flink to conduct large‑scale distributed load testing for the 818 event and outlines future improvements toward Kubernetes and direct Kafka logging.

Docker SwarmFlinkJMeter
0 likes · 8 min read
Design and Practice of Autohome's Performance Testing Platform PTS
Architecture Digest
Architecture Digest
Jan 21, 2022 · Big Data

Building a Real-Time Data Warehouse with Flink: Architecture, Core Concepts, and Practical Implementation

This article explains how to build a unified stream‑batch real‑time data warehouse using FlinkSQL, covering prerequisite knowledge, five core concepts, two implementation approaches, a comparison of traditional versus real‑time architectures, and a comprehensive hands‑on example, illustrated with diagrams.

Batch ProcessingData ArchitectureFlink
0 likes · 6 min read
Building a Real-Time Data Warehouse with Flink: Architecture, Core Concepts, and Practical Implementation
StarRocks
StarRocks
Jan 12, 2022 · Big Data

How Flink + StarRocks Deliver Lightning‑Fast Real‑Time Data Warehousing

This article explains the evolution, challenges, and technical solutions for building an end‑to‑end real‑time data warehouse by combining Apache Flink's stream processing with StarRocks' ultra‑fast OLAP engine, covering architecture, data models, integration methods, best‑practice cases, and future roadmap.

Big DataFlinkOLAP
0 likes · 21 min read
How Flink + StarRocks Deliver Lightning‑Fast Real‑Time Data Warehousing