Tagged articles

946 articles

Page 4 of 10

Mar 30, 2023 · Big Data

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

Apache Paimon, newly incubated by the Apache Software Foundation, combines Flink's real‑time streaming capabilities with open lakehouse storage formats, offering high‑throughput, low‑latency data ingestion, partial‑update merges, and seamless integration with engines like Flink, Spark, and Trino for unified batch and streaming analytics.

Apache PaimonBig DataData Lake

0 likes · 7 min read

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

ITPUB

Mar 28, 2023 · Big Data

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

This article details the migration from a traditional Hive‑based data warehouse to a lakehouse architecture using Apache Hudi, covering the original Lambda setup, its pain points, lake‑vs‑warehouse differences, Hudi features, integration challenges, practical solutions, and future roadmap.

Apache HudiBig DataFlink

0 likes · 11 min read

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

DataFunTalk

Mar 28, 2023 · Artificial Intelligence

FeatHub: An Open‑Source Feature Store for Real‑Time and Offline Feature Engineering

This article introduces FeatHub, an open‑source feature‑store project from Alibaba Cloud that provides a Python SDK, flexible architecture, and execution engines such as Flink and Spark to simplify the development, deployment, monitoring, and sharing of real‑time and offline machine‑learning features across multi‑cloud environments.

Feature StoreFlinkPython SDK

0 likes · 21 min read

FeatHub: An Open‑Source Feature Store for Real‑Time and Offline Feature Engineering

DataFunTalk

Mar 25, 2023 · Artificial Intelligence

ZhongAn Financial Real‑Time Feature Platform: MLOps Practices, Architecture and Anti‑Fraud Applications

This article presents ZhongAn Financial’s end‑to‑end MLOps workflow and real‑time feature platform architecture, detailing team roles, data pipelines, Flink‑based processing, TableStore storage, anti‑fraud feature design, and answers to common implementation questions, offering a comprehensive guide for building scalable, low‑latency ML services in finance.

FlinkMLOpsTablestore

0 likes · 25 min read

ZhongAn Financial Real‑Time Feature Platform: MLOps Practices, Architecture and Anti‑Fraud Applications

DeWu Technology

Mar 22, 2023 · Big Data

Analysis of Flink Scheduling Components and Slot Allocation

The article explains Flink’s post‑submission scheduling pipeline—from Dispatcher creating SchedulerNG and building the ExecutionGraph, through pipelined region construction and the PipelinedRegionSchedulingStrategy, to slot sharing allocation—identifying why slot and TaskManager overloads occur and proposing randomization or fine‑grained resource strategies to balance load.

DistributedSystemsExecutionGraphFlink

0 likes · 14 min read

Analysis of Flink Scheduling Components and Slot Allocation

ITPUB

Mar 13, 2023 · Big Data

What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements

Apache Kyuubi 1.6.0 introduces major server‑side upgrades such as batch JAR task submission with RESTful APIs and a metadata store for HA, client‑side improvements including a unified JDBC driver and enhanced Beeline, plus mature Spark, Flink, Trino, and Hive engine plugins, while outlining the community’s roadmap.

Big DataEngine PluginsFlink

0 likes · 13 min read

What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements

DataFunTalk

Mar 12, 2023 · Big Data

Apache Kyuubi 1.6.0 Feature Overview and Enhancements

The article provides a comprehensive walkthrough of Apache Kyuubi 1.6.0, detailing server‑side enhancements such as batch (JAR) task submission, metadata store and unified API/authentication, client‑side improvements to the built‑in JDBC driver and Beeline, as well as engine plugins for Spark, Flink, Trino and Hive, and concludes with the community’s roadmap and statistics.

Apache KyuubiBatch ProcessingBig Data

0 likes · 12 min read

Apache Kyuubi 1.6.0 Feature Overview and Enhancements

Big Data Technology & Architecture

Mar 9, 2023 · Big Data

Implementing Exactly-Once Semantics with Flink and Kafka: Utility Classes, Character Count Example, and Transactional Consumer

This article demonstrates how to achieve exactly‑once processing in Flink by providing Kafka I/O utility classes, a character‑count streaming example, and a transactional consumer implementation, while also discussing configuration nuances and common pitfalls.

Big DataExactly-OnceFlink

0 likes · 11 min read

Implementing Exactly-Once Semantics with Flink and Kafka: Utility Classes, Character Count Example, and Transactional Consumer

DataFunTalk

Mar 9, 2023 · Big Data

Real‑Time Data Platform Architecture and Cloud‑Native Flink Migration at Manbang

This article presents a comprehensive case study of Manbang's real‑time data platform, detailing its business background, cloud‑native Flink + Hologres architecture, migration from self‑built clusters, real‑time product features, decision‑making workflows, and future roadmap, highlighting performance and cost benefits.

FlinkLogisticsStreaming

0 likes · 16 min read

Real‑Time Data Platform Architecture and Cloud‑Native Flink Migration at Manbang

dbaplus Community

Mar 7, 2023 · Operations

How We Rescued a ClickHouse Logging Cluster After Zookeeper‑Induced Read‑Only Failure

A production logging system became unavailable due to Kafka backlog alerts, prompting an investigation that uncovered read‑only ClickHouse tables caused by mismatched Zookeeper metadata after a TTL policy change, leading to a step‑by‑step recovery involving Zookeeper restarts, metadata fixes, and table reconstruction.

Cluster RecoveryFlinkKafka

0 likes · 9 min read

How We Rescued a ClickHouse Logging Cluster After Zookeeper‑Induced Read‑Only Failure

Big Data Technology & Architecture

Mar 7, 2023 · Big Data

Implementing Exactly-Once Kafka-to-Redis with Flink: Two-Phase Commit Sink and Bug Fixes

This tutorial explains how to achieve exactly‑once semantics when streaming data from Kafka to Redis using Apache Flink's TwoPhaseCommitSinkFunction, covering Redis transaction basics, utility classes, sink implementation, testing steps, and solutions to common connection and transaction bugs.

Big DataExactly-OnceFlink

0 likes · 11 min read

Implementing Exactly-Once Kafka-to-Redis with Flink: Two-Phase Commit Sink and Bug Fixes

Alibaba Cloud Big Data AI Platform

Mar 3, 2023 · Big Data

How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

This article outlines Alibaba Cloud EMR's three‑stage evolution—compatibility, contribution, and beyond open source—detailing its early Hadoop adoption, Flink and Spark innovations, cloud‑native optimizations, and enterprise‑grade features such as Remote Shuffle Service, performance benchmarks, and integrated diagnostics.

Alibaba CloudBig DataCloud Native

0 likes · 13 min read

How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

DataFunTalk

Mar 1, 2023 · Databases

Evolution and Optimization of Tencent Music Content Library Data Platform: From Architecture 1.0 to 4.0

This article details the evolution of Tencent Music's content library data platform from version 1.0 to 4.0, describing business requirements, architectural redesigns—including migration from ClickHouse to Apache Doris, introduction of a semantic layer, and extensive write, query, and cost optimizations—while sharing practical lessons and future directions.

Apache DorisBig DataFlink

0 likes · 21 min read

Evolution and Optimization of Tencent Music Content Library Data Platform: From Architecture 1.0 to 4.0

Alibaba Cloud Big Data AI Platform

Mar 1, 2023 · Big Data

How We Built a Scalable Real‑Time Data Architecture for a Complex Supply Chain

This article describes the challenges of a highly complex supply‑chain system, the evolution from early MySQL‑based reporting to a modern real‑time data platform using Flink, Kafka, ClickHouse, Hologres and other cloud services, and the tools and lessons learned to achieve low‑latency, high‑throughput analytics.

FlinkKafkaStreaming

0 likes · 11 min read

How We Built a Scalable Real‑Time Data Architecture for a Complex Supply Chain

DataFunSummit

Feb 28, 2023 · Big Data

Iceberg Technology Overview and Its Application at Xiaomi: Practices, Stream‑Batch Integration, and Future Plans

This article introduces the Iceberg table format, explains its core architecture and advantages such as transactionality, implicit partitioning and row‑level updates, details Xiaomi's practical deployments—including CDC pipelines, partition strategies, compaction services, and stream‑batch integration—and outlines future development directions.

Data LakeFlinkIceberg

0 likes · 20 min read

Iceberg Technology Overview and Its Application at Xiaomi: Practices, Stream‑Batch Integration, and Future Plans

Volcano Engine Developer Services

Feb 28, 2023 · Big Data

How to Migrate State in Flink SQL Jobs with DAG Visualization and UID Mapping

This article explains why preserving state during Flink SQL job iterations is crucial, analyzes the challenges caused by DAG changes and serializer incompatibility, and presents a visual‑preview, UID editing, and automatic mapping solution to enable reliable state migration for streaming workloads.

DAG VisualizationFlinkOperatorID

0 likes · 14 min read

How to Migrate State in Flink SQL Jobs with DAG Visualization and UID Mapping

Big Data Technology & Architecture

Feb 28, 2023 · Big Data

Comprehensive Guide to Dual‑Stream Join in Flink CDC with Java DataStream API

This article provides a detailed tutorial on implementing various dual‑stream join techniques—including processing‑time, event‑time, and interval joins—using Flink CDC 2.2 and Flink 1.14 with the Java DataStream API, complete with code examples, SQL setup, and execution results.

Big DataCDCDataStream

0 likes · 31 min read

Comprehensive Guide to Dual‑Stream Join in Flink CDC with Java DataStream API

macrozheng

Feb 28, 2023 · Big Data

How Tencent Music Scaled Its Content Data Platform with Apache Doris: From ClickHouse to 4.0 Architecture

This article details the evolution of Tencent Music's content data platform from version 1.0 to 4.0, describing the migration from ClickHouse to Apache Doris, the introduction of a semantic layer, optimization of data ingestion, query performance, and cost reduction strategies that dramatically improved data timeliness, operational efficiency, and storage costs.

Apache DorisBig DataData Architecture

0 likes · 23 min read

How Tencent Music Scaled Its Content Data Platform with Apache Doris: From ClickHouse to 4.0 Architecture

DeWu Technology

Feb 24, 2023 · Big Data

Real-Time Data Architecture Evolution for a Complex Supply Chain

The article traces Dewu’s supply‑chain data platform from slow MySQL reporting through early CDC‑based wide tables to a Flink‑Kafka‑ClickHouse 1.0 design, then to a more scalable Flink‑Kafka‑Hologres 2.0 architecture that solves upsert and compute‑storage separation, while detailing key operational tricks, code‑generation tools, and future plans for lake‑house integration.

Big DataFlinkHologres

0 likes · 10 min read

Real-Time Data Architecture Evolution for a Complex Supply Chain

Big Data Technology & Architecture

Feb 24, 2023 · Big Data

Common Flink Task Submission Issues and Solutions on YARN

This article compiles frequent Flink job submission problems on YARN—including WordCount jar errors, HBase dependency conflicts, MySQL timeout, checkpoint restoration failures, parallelism limits, and unexpected container termination—provides root‑cause analysis and step‑by‑step remediation instructions.

Big DataCheckpointFlink

0 likes · 21 min read

Common Flink Task Submission Issues and Solutions on YARN

Architects Research Society

Feb 21, 2023 · Big Data

Comparing Apache Spark and Apache Flink: Origins, Architecture, and Processing Models

This article examines the evolution, architectural differences, data and processing models, stateful handling, and programming APIs of Apache Spark and Apache Flink, highlighting their strengths, limitations, and the challenges of big‑data development and operations in the modern data‑driven era.

Batch ProcessingBig DataData Engine

0 likes · 18 min read

Comparing Apache Spark and Apache Flink: Origins, Architecture, and Processing Models

dbaplus Community

Feb 15, 2023 · Big Data

How Bilibili Scaled User Behavior Analytics with ClickHouse, Flink, and Iceberg

This article details Bilibili's 北极星 user behavior analysis platform, tracing its evolution from early Spark‑Jar models to Flink‑ClickHouse pipelines and Iceberg‑based full aggregation, and explains the technical solutions for event, retention, funnel, path analysis, data ingestion, cluster rebalancing, and performance optimizations that enable massive real‑time analytics on billions of daily events.

FlinkIcebergReal-time Processing

0 likes · 32 min read

How Bilibili Scaled User Behavior Analytics with ClickHouse, Flink, and Iceberg

Big Data Technology & Architecture

Feb 15, 2023 · Big Data

Flink Multi-Stream Union Operations and Event-Time Sorting

This article explains how to use Flink's DataStream.union() to combine multiple streams of the same type, demonstrates Maven project setup and code examples for simple unions and for unions with custom event-time sorting, and shows the resulting ordered output.

Big DataDataStreamEventTime

0 likes · 15 min read

Flink Multi-Stream Union Operations and Event-Time Sorting

DataFunSummit

Feb 14, 2023 · Big Data

Real-time Multi-dimensional Analytics at ZhongAn: Practices, Challenges, and Technology Choices

This article presents ZhongAn Insurance's experience building real-time multi-dimensional analytics, covering application scenarios, technical challenges, the evolution of their architecture from offline to Flink‑ClickHouse and finally to StarRocks, and the principles guiding their technology selection.

FlinkOLAPReal-time analytics

0 likes · 26 min read

Real-time Multi-dimensional Analytics at ZhongAn: Practices, Challenges, and Technology Choices

Big Data Technology & Architecture

Feb 10, 2023 · Big Data

The Most Comprehensive Big Data Interview Preparation Handbook

This article presents a curated collection of big‑data learning resources, including interview guides, in‑depth articles on Flink, Spark, Hive, ClickHouse, data governance, and personal growth, offering readers a one‑stop reference to boost their big‑data expertise and interview readiness.

Big DataData GovernanceFlink

0 likes · 5 min read

The Most Comprehensive Big Data Interview Preparation Handbook

Big Data Technology & Architecture

Feb 9, 2023 · Big Data

The Most Comprehensive Big Data Interview Preparation Handbook and Resource Collection

This article presents a curated collection of the most comprehensive big‑data interview preparation resources, including expert guides, tutorials, and deep‑dive articles on Flink, Spark, Hive, ClickHouse, data governance, and related topics, accompanied by a call to engage with the content.

Big DataData GovernanceFlink

0 likes · 4 min read

The Most Comprehensive Big Data Interview Preparation Handbook and Resource Collection

Big Data Technology & Architecture

Feb 8, 2023 · Big Data

Enabling Early‑Fire Window Computation in Flink SQL for Real‑Time Metrics

This article explains how to configure Flink SQL to emit early‑fire results for tumbling windows, allowing real‑time aggregation of metrics like PV and UV, and provides complete example code, execution output, and a discussion of current limitations.

Early FireFlinkKafka

0 likes · 10 min read

Enabling Early‑Fire Window Computation in Flink SQL for Real‑Time Metrics

ITPUB

Feb 7, 2023 · Big Data

How Kuaigou Built a Scalable Real‑Time Data Warehouse with Spark, Flink, and Cloud

Facing massive, multi‑source traffic and the need for instant analytics, Kuaigou’s real‑time data warehouse evolved from Spark on‑premise to a cloud‑native stack using Alibaba Blink, Flink, and layered OLAP models, streamlining development, cutting costs, and enabling diverse real‑time applications.

FlinkOLAPSpark

0 likes · 11 min read

How Kuaigou Built a Scalable Real‑Time Data Warehouse with Spark, Flink, and Cloud

Big Data Technology & Architecture

Feb 6, 2023 · Big Data

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

This article presents an in‑depth overview of real‑time data‑warehouse scenarios, discusses challenges such as timeliness, update efficiency, and resource consumption, and details practical solutions using Apache Hudi, Flink, Presto, and related optimizations for ingestion, indexing, compaction, and query performance.

Big DataData LakeFlink

0 likes · 17 min read

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

dbaplus Community

Jan 31, 2023 · Big Data

Building ByteDance’s Real‑Time Data Warehouse with Hudi: Architecture & Solutions

This article explains how ByteDance designed and deployed a real‑time data warehouse on a data lake using Hudi, detailing three business scenarios, the challenges of latency, consistency and resource usage, and the engineering solutions—including upserts, compaction services, indexing, and future unified storage plans.

Data LakeFlinkHudi

0 likes · 14 min read

Building ByteDance’s Real‑Time Data Warehouse with Hudi: Architecture & Solutions

Bilibili Tech

Jan 31, 2023 · Big Data

Design and Optimization of Real-Time Data Quality Control (DQC) Platform on Bilibili's Big Data System

Bilibili redesigned its real-time data-quality control platform by replacing per-rule Flink jobs with a unified, dynamically-configured architecture that classifies Kafka topics, aggregates via InfluxDB full-table and continuous queries, mitigates data inflation, adds a high-performance proxy, and implements robust monitoring and recovery to ensure scalable, reliable data quality for its big-data services.

Big DataDQCFlink

0 likes · 22 min read

Design and Optimization of Real-Time Data Quality Control (DQC) Platform on Bilibili's Big Data System

Big Data Technology & Architecture

Jan 29, 2023 · Big Data

Understanding Retract Streams in Apache Flink: Aggregation and Sink Operators

This article explains the concept of retract streams in Apache Flink, detailing how non‑retract Kafka sources and Group‑By aggregations generate delete/insert messages, provides code examples for aggregation and sink operators, and compares retract mechanisms across aggregation and CDC sink scenarios.

CDCFlinkKafka

0 likes · 15 min read

Understanding Retract Streams in Apache Flink: Aggregation and Sink Operators

DataFunSummit

Jan 27, 2023 · Databases

StarRocks in Youzu's Multi-Dimensional Analytics: Architecture, Advantages, and Future Plans

This article presents Youzu Network’s adoption of StarRocks for multi-dimensional analytics, detailing the historical OLAP challenges, StarRocks’ features and advantages, its application scenarios, data modeling choices, ingestion methods, performance benchmarks, and future roadmap for unified analytics.

Big DataFlinkKafka

0 likes · 18 min read

StarRocks in Youzu's Multi-Dimensional Analytics: Architecture, Advantages, and Future Plans

ITPUB

Jan 26, 2023 · Big Data

How NetEase’s Arctic Unifies Streaming and Batch with Iceberg for Real‑Time Lakehouse

This article explains the challenges of a Lambda‑architecture data pipeline, introduces NetEase’s Arctic lakehouse built on Apache Iceberg, details its table‑store design, optimization cycles, consistency mechanisms, real‑time features, practical use cases, and future roadmap, highlighting its advantages over similar solutions.

ArcticData IntegrationFlink

0 likes · 14 min read

How NetEase’s Arctic Unifies Streaming and Batch with Iceberg for Real‑Time Lakehouse

ITPUB

Jan 22, 2023 · Big Data

How Flink Table Store Powers Real‑Time Financial Data Warehousing

This article details a banking‑focused real‑time data‑warehouse solution that leverages Flink Table Store to handle both incremental fact‑table updates and full‑table dimension calculations, compares three traditional approaches, and explains data ingestion, query modes, export options, and future streaming‑warehouse directions.

BankingELTFlink

0 likes · 20 min read

How Flink Table Store Powers Real‑Time Financial Data Warehousing

Sohu Tech Products

Jan 18, 2023 · Big Data

Root Cause Analysis of Flink TaskManager Failover Causing Data Reprocessing and Business Impact

An incident report details how a scheduled machine reboot on Alibaba Cloud triggered a Flink TaskManager failover, leading to excessive data replay, increased ES pressure, and significant business latency, and explains the root cause involving disabled checkpoints and timestamp‑based offset consumption.

CheckpointFlinkKafka

0 likes · 10 min read

Root Cause Analysis of Flink TaskManager Failover Causing Data Reprocessing and Business Impact

Alibaba Cloud Native

Jan 12, 2023 · Cloud Native

How to Prevent Flink Job Restarts by Managing ZooKeeper zxid Overflow and Leader Election

This article explains the cause of unexpected Flink job restarts caused by ZooKeeper zxid overflow, details how the zxid works, why overflow forces a new leader election, and presents practical risk‑management and alerting solutions to avoid business loss.

FlinkZooKeeperleader election

0 likes · 6 min read

How to Prevent Flink Job Restarts by Managing ZooKeeper zxid Overflow and Leader Election

Ctrip Technology

Jan 12, 2023 · Big Data

Real-Time Data Warehouse Architecture and Practice at Ctrip Hotel

The article explains why enterprises need real-time data warehouses, compares Lambda and Kappa architectures, describes Ctrip Hotel's Lambda‑plus‑OLAP variant built with Flink and StarRocks, and details practical solutions for ordering, wide‑table generation, and data validation that enable billion‑row, low‑latency analytics.

CtripFlinkLambda architecture

0 likes · 10 min read

Real-Time Data Warehouse Architecture and Practice at Ctrip Hotel

Alimama Tech

Jan 11, 2023 · Big Data

Dolphin Streaming: Real-Time SQL-Based Data Development Platform for Alibaba Advertising

Dolphin Streaming provides Alibaba’s advertising merchants with a DB‑like, SQL‑driven real‑time data platform built on Flink that abstracts storage and compute, enabling non‑engineers to develop, debug, and deploy streaming feature jobs quickly, boosting query volume, QPS, and revenue.

Dolphin StreamingFlinkReal-time Streaming

0 likes · 13 min read

Dolphin Streaming: Real-Time SQL-Based Data Development Platform for Alibaba Advertising

DataFunSummit

Jan 10, 2023 · Big Data

Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans

This article presents a comprehensive overview of Iceberg's adoption in Huawei Terminal Cloud, covering its architectural overview, key features such as Git‑style data management, real‑time processing, acceleration layers, and future development directions, along with a Q&A session addressing performance and implementation details.

Big DataData LakeFlink

0 likes · 15 min read

Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans

Bilibili Tech

Jan 10, 2023 · Big Data

Technical Evolution of Bilibili's PolarStar User Behavior Analysis Platform

Bilibili’s PolarStar platform evolved from Spark‑based batch jobs to a Flink‑driven real‑time pipeline and finally to a unified Iceberg‑on‑ClickHouse model, cutting query latency to seconds, saving thousands of CPU cores and hundreds of gigabytes of Redis memory while enabling complex, near‑real‑time user‑behavior analyses and scalable data‑import, rebalancing, and compression optimizations.

FlinkIcebergclickhouse

0 likes · 30 min read

Technical Evolution of Bilibili's PolarStar User Behavior Analysis Platform

Alibaba Cloud Big Data AI Platform

Jan 10, 2023 · Big Data

How Alibaba’s Dolphin Engine Uses Flink + Hologres for Real‑Time Big Data

The Dolphin engine, built by Alibaba’s Data Engine team, combines Flink and Hologres to deliver ultra‑large‑scale OLAP, streaming, batch, and AI capabilities for real‑time advertising analytics, offering smart materialization, intelligent indexing, and vector recall while supporting millions of advertisers and petabyte‑level data.

Big DataFlinkHologres

0 likes · 13 min read

How Alibaba’s Dolphin Engine Uses Flink + Hologres for Real‑Time Big Data

DataFunTalk

Jan 6, 2023 · Big Data

ZhongAn's Hundred‑Billion‑Scale Data Integration Service: Architecture, Business Support, and Evolution

This article presents the architecture and practical experience of ZhongAn's hundred‑billion‑scale data integration service, covering common integration technologies, business support scenarios for offline and real‑time data, technical challenges, evolution from single‑machine to service‑oriented designs, and future directions using Flink and DataX.

Data IntegrationData PlatformDataX

0 likes · 31 min read

ZhongAn's Hundred‑Billion‑Scale Data Integration Service: Architecture, Business Support, and Evolution

Big Data Technology & Architecture

Jan 3, 2023 · Big Data

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

This article explains how to use Apache Flink 1.16's SQL Gateway to migrate Hive SQL tasks to Flink, covering the underlying Hive‑on‑Flink architecture, dialect compatibility, streaming and batch demos, configuration details, and practical tips for developers and platform engineers.

Batch ProcessingBig DataFlink

0 likes · 19 min read

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

DataFunTalk

Jan 1, 2023 · Big Data

Zhihu's Real-Time Computing Platform: From Skytree 1.0 to Mipha 2.0

Zhihu’s real‑time computing platform, initially built as Skytree 1.0 on Kubernetes and later re‑engineered as Mipha 2.0 with Flink SQL, unified metadata management, dynamic jar loading, UDF support, Protobuf format, CDC integration, and extensive operational optimizations, now processes petabyte‑scale data with high reliability.

FlinkKubernetesReal‑Time Computing

0 likes · 21 min read

Zhihu's Real-Time Computing Platform: From Skytree 1.0 to Mipha 2.0

DataFunSummit

Dec 31, 2022 · Big Data

The Evolution of Data Platforms: From Early Computing to the Modern Big Data Stack

This article reviews the history of data platforms—from the first general‑purpose computers and early relational databases through traditional BI, agile BI, and big‑data technologies like Hadoop, Spark, and Flink, up to today’s cloud‑native modern data stack and its future outlook.

Big DataData PlatformFlink

0 likes · 26 min read

The Evolution of Data Platforms: From Early Computing to the Modern Big Data Stack

Alibaba Cloud Big Data AI Platform

Dec 30, 2022 · Big Data

How Manbang Built a Cloud‑Native Real‑Time Data Platform with Flink & Hologres

Manbang's logistics platform leverages a cloud‑native architecture built on Alibaba Cloud Flink and Hologres to deliver minute‑level real‑time data, feature computation, and decision‑making that dramatically improves SLA, reduces operational costs, and powers intelligent driver‑cargo matching across the ecosystem.

FlinkHologresLogistics

0 likes · 16 min read

How Manbang Built a Cloud‑Native Real‑Time Data Platform with Flink & Hologres

Big Data Technology & Architecture

Dec 28, 2022 · Big Data

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

Flink 1.16 introduces adaptive batch scheduling, speculative execution, hybrid shuffle, dynamic partition pruning, improved Hive SQL compatibility, advanced checkpoint mechanisms including changelog backend, and integrates CDC with Kafka and Table Store, offering faster, more stable, and easier-to-use stream‑batch processing capabilities.

Big DataCDCCheckpoint

0 likes · 8 min read

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

DataFunTalk

Dec 27, 2022 · Big Data

Multi‑Stream Join and Concurrency Control in Apache Hudi: Design, Implementation, and Usage

This article presents a comprehensive solution for multi‑stream joins in Apache Hudi, detailing the challenges of dimension and multi‑stream joins, the novel storage‑layer join approach, timeline‑based concurrency control, marker mechanisms, early conflict detection, payload customization, and practical usage with Flink and Spark, along with performance benefits and future directions.

Apache HudiData LakeFlink

0 likes · 31 min read

Multi‑Stream Join and Concurrency Control in Apache Hudi: Design, Implementation, and Usage

Tencent Advertising Technology

Dec 27, 2022 · Big Data

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.

Big DataData LakeFlink

0 likes · 20 min read

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

Data Thinking Notes

Dec 23, 2022 · Big Data

How Real-Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

This article explains why real‑time data warehouses are becoming essential, outlines their goals, compares them with traditional offline warehouses, and presents detailed design patterns, naming conventions, and case studies from Didi, Kuaishou, Tencent, Youzan and other enterprises, highlighting challenges and solutions for streaming, storage, and query layers.

Big Data ArchitectureData LakeETL

0 likes · 49 min read

How Real-Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

DataFunTalk

Dec 23, 2022 · Big Data

Building a Lakehouse on Alibaba Cloud AnalyticDB (ADB) with Apache Hudi: Architecture, Challenges, and Practices

This article presents a comprehensive technical overview of Alibaba Cloud AnalyticDB's Lakehouse edition, detailing its unified architecture, key advantages, the challenges of ingesting billions of records with Apache Hudi, and the engineering solutions—including Flink integration, hotspot mitigation, memory optimization, OSS throttling handling, concurrent write support, lifecycle management, and TableService—that enable a cost‑effective, high‑performance lake‑to‑warehouse platform.

Apache HudiFlinkLakehouse

0 likes · 19 min read

Building a Lakehouse on Alibaba Cloud AnalyticDB (ADB) with Apache Hudi: Architecture, Challenges, and Practices

ITPUB

Dec 21, 2022 · Big Data

How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs

This article details Bilibili's extensive enhancements to the Flink runtime—including checkpoint recoverability, max‑parallelism calculations, State Processor API extensions, Full and Regional Checkpoints, hybrid HA, task‑level recovery, load‑balanced partitioners, and large‑scale cluster maintenance—to improve reliability and performance of its billion‑scale streaming workloads.

Big DataCheckpointFlink

0 likes · 33 min read

How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs

DataFunTalk

Dec 20, 2022 · Big Data

ByteDance's Practices for Tracking Data Governance and Pipeline Management

This article explains ByteDance's end‑to‑end tracking data lifecycle management, including pre‑report validation, the rationale for using BMQ over Kafka, quality governance examples, and how Flink‑based pipelines ensure data accuracy through SLA monitoring and checkpoint strategies.

Data GovernanceData TrackingFlink

0 likes · 5 min read

ByteDance's Practices for Tracking Data Governance and Pipeline Management

Big Data Technology & Architecture

Dec 19, 2022 · Big Data

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

This article presents a comprehensive overview of TikTok e-commerce's near‑real‑time data lake implementation, detailing data lake characteristics, architecture choices, practical use cases across analysis and operations, and for future challenges and plans.

Apache HudiBig DataData Lake

0 likes · 16 min read

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

ITPUB

Dec 18, 2022 · Big Data

How to Build a Real‑Time Data Warehouse with EasyData: A Step‑by‑Step Guide

Learn how to design and implement a real‑time data warehouse for an app’s AB‑test monitoring using EasyData, covering data flow layers, CDC task creation, stream table registration, Flink SQL processing, and BI reporting, with detailed steps, code snippets, and practical tips.

CDCEasyDataFlink

0 likes · 13 min read

How to Build a Real‑Time Data Warehouse with EasyData: A Step‑by‑Step Guide

Big Data Technology & Architecture

Dec 15, 2022 · Big Data

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

This technical article presents a comprehensive overview of migrating Hive SQL to Flink SQL, covering the motivations behind the migration, key challenges such as compatibility, stability and performance, practical implementation steps, a detailed demo, future development directions, and a Q&A session addressing common concerns.

Batch ProcessingBig DataData Lake

0 likes · 13 min read

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

Alibaba Cloud Big Data AI Platform

Dec 9, 2022 · Operations

How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming

This article details Alibaba Cloud's Flink Cluster Inspector, explaining the business challenges of hotspot machines, the analysis of resource over‑use, and the four‑stage solution—pre‑profiling, in‑process self‑healing, post‑recovery, and observability—that reduces latency, cuts costs, and improves operational efficiency.

ClusterFlinkHotSpot

0 likes · 19 min read

How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming

DataFunTalk

Dec 8, 2022 · Big Data

Arctic: NetEase’s Real-Time Lakehouse System Built on Apache Iceberg

This article introduces NetEase’s Arctic, a real‑time lakehouse system built on Apache Iceberg that unifies streaming and batch processing, explains the challenges of Lambda architecture, details Arctic’s features such as change/base stores, hidden queue, transaction handling, and shares internal practice cases and future roadmap.

Apache IcebergArcticData Lake

0 likes · 12 min read

Arctic: NetEase’s Real-Time Lakehouse System Built on Apache Iceberg

政采云技术

Dec 8, 2022 · Big Data

Understanding Flink's Asynchronous Barrier Snapshotting (ABS) Checkpoint Algorithm

This article explains the Asynchronous Barrier Snapshotting algorithm used by Apache Flink for checkpointing, detailing its origins from the Chandy‑Lamport algorithm, its operation in both acyclic and cyclic dataflow graphs, barrier alignment, and the fault‑recovery process.

Asynchronous Barrier SnapshottingCheckpointDistributed Systems

0 likes · 10 min read

Understanding Flink's Asynchronous Barrier Snapshotting (ABS) Checkpoint Algorithm

DataFunSummit

Dec 2, 2022 · Big Data

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

BitSail, ByteDance’s open‑source data integration engine, unifies batch, streaming, and incremental data synchronization across heterogeneous sources, detailing its evolution from early Flink‑based prototypes to a mature, plugin‑driven architecture with multi‑engine support, low‑cost co‑development, and robust CDC lakehouse capabilities.

Big DataCDCFlink

0 likes · 19 min read

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

Tencent Cloud Developer

Dec 2, 2022 · Big Data

Design and Implementation of a Hundred‑Billion‑Scale Real‑Time Monitoring System

The paper presents the design and deployment of a hundred‑billion‑scale real‑time monitoring platform that meets stringent data‑collection, analysis, storage, alerting and visualization requirements, compares Oceanus + Elastic Stack against a Zabbix‑Prometheus‑Grafana stack, selects the former, and details performance‑and cost‑optimizations that enable massive, low‑latency monitoring while maintaining high availability.

ElasticsearchFlinkOceanus

0 likes · 20 min read

Design and Implementation of a Hundred‑Billion‑Scale Real‑Time Monitoring System

Bilibili Tech

Nov 29, 2022 · Big Data

How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations

This article details Bilibili's extensive enhancements to Flink's runtime—including checkpoint recoverability, operator ID stability, state processor extensions, hybrid high‑availability, regional checkpointing, and load‑based channel selection—to improve scalability, reliability, and operational efficiency of large‑scale streaming jobs.

Big DataCheckpointFlink

0 likes · 32 min read

How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations

DaTaobao Tech

Nov 23, 2022 · Big Data

Real-time Log Aggregation and Monitoring with Blink (Flink) on Mobile Endpoints

The article explains how Blink, Alibaba’s optimized Flink variant, uses dynamic tables and streaming‑SQL to ingest mobile telemetry via source tables, compute per‑minute metrics such as API success rates with tumbling windows, and write results to Alibaba Cloud Log Service, enabling real‑time dashboards and extensible use cases like fraud detection.

FlinkReal-time Streamingblink

0 likes · 10 min read

Real-time Log Aggregation and Monitoring with Blink (Flink) on Mobile Endpoints

21CTO

Nov 20, 2022 · Big Data

How Meituan’s Logan Real‑Time Log System Boosts Debugging Across Mobile, Web, and IoT

This article details the design, architecture, and implementation of Meituan's Logan real‑time logging platform, covering its workflow, multi‑terminal collection SDK, ingestion, Flink‑based processing, consumption layers, stability measures, and future roadmap, illustrating how it improves fault diagnosis and system reliability.

ElasticsearchFlinkKafka

0 likes · 18 min read

How Meituan’s Logan Real‑Time Log System Boosts Debugging Across Mobile, Web, and IoT

ITPUB

Nov 18, 2022 · Big Data

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

This article introduces Iceberg’s table‑format fundamentals, details Xiaomi’s large‑scale deployment of Iceberg for CDC and log ingestion, explores their streaming‑batch integration experiments, outlines future roadmap items, and provides a comprehensive Q&A covering practical challenges and solutions.

Batch ProcessingBig DataData Lake

0 likes · 23 min read

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

Liulishuo Tech Team

Nov 17, 2022 · Big Data

Real‑time Data Warehouse Architecture and Technical Solution at Liulishuo

This article describes Liulishuo's migration to a Flink‑based real‑time data warehouse, covering background, benefits, technology selection (storage, Flink platform, dimension table connectors), overall architecture, concrete Hudi and Elasticsearch ingestion examples, processing SQL, and future outlook for unified batch‑streaming storage.

ElasticsearchFlinkHudi

0 likes · 15 min read

Real‑time Data Warehouse Architecture and Technical Solution at Liulishuo

DataFunTalk

Nov 15, 2022 · Artificial Intelligence

Flink ML: Iterative Execution Engine, Design, API, and Efficient Algorithm Library

This article introduces Flink ML, a DataStream‑based iterative engine and machine‑learning algorithm library, covering its overview, iterative execution engine design and API, performance comparisons with Spark ML, online logistic regression and K‑Means demos, and future development roadmap.

FlinkIterative EngineKMeans

0 likes · 22 min read

Flink ML: Iterative Execution Engine, Design, API, and Efficient Algorithm Library

DataFunTalk

Nov 13, 2022 · Big Data

Iceberg Data Lake: Technology Overview, Xiaomi Practices, and Stream‑Batch Integration

This article presents an overview of the Iceberg table format, its core architecture and advantages, details Xiaomi’s large‑scale deployment and use cases, explores stream‑batch integration with Spark and Flink, outlines data correction methods, future plans, and answers common technical questions.

Data LakeFlinkIceberg

0 likes · 20 min read

Iceberg Data Lake: Technology Overview, Xiaomi Practices, and Stream‑Batch Integration

AsiaInfo Technology: New Tech Exploration

Nov 11, 2022 · Industry Insights

How Real-Time Data Middle Platforms are Transforming the Telecom Industry

This article analyzes why telecom operators need a real‑time data middle platform, outlines its layered architecture and model design, examines the shift from Lambda to Kappa and lakehouse approaches, and highlights how these innovations enable faster, scenario‑driven insights and competitive advantage.

Big Data ArchitectureData Middle PlatformFlink

0 likes · 15 min read

How Real-Time Data Middle Platforms are Transforming the Telecom Industry

DataFunTalk

Nov 9, 2022 · Artificial Intelligence

Design and Usage of Flink ML Java and Python APIs, Ecosystem Construction, and Future Directions

This article introduces the Flink Machine Learning Library, detailing the design and usage of its Java and Python APIs, core interfaces such as WithParams, Stage, Estimator, and AlgoOperator, workflow for training and inference, pipeline/graph construction, ecosystem initiatives, and upcoming development plans.

FlinkJava APIPython API

0 likes · 12 min read

Design and Usage of Flink ML Java and Python APIs, Ecosystem Construction, and Future Directions

High Availability Architecture

Nov 7, 2022 · Backend Development

Design and Implementation of Meituan's Logan Real-Time Log System

This article describes how Meituan built Logan, a high‑performance, end‑to‑end real‑time logging platform for mobile, web, mini‑programs and IoT, covering its background, architecture, data collection, processing, consumption, monitoring, deployment strategies, achieved results and future roadmap.

Backend ArchitectureElasticsearchFlink

0 likes · 15 min read

Design and Implementation of Meituan's Logan Real-Time Log System

DataFunTalk

Nov 6, 2022 · Big Data

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

BitSail, an open‑source data integration engine from ByteDance, provides a unified solution for batch, streaming, full‑load, and incremental data synchronization across heterogeneous sources, detailing its background, technical evolution, architecture, low‑cost co‑building features, compatibility strategies, and future roadmap.

CDCData IntegrationFlink

0 likes · 18 min read

Alibaba Cloud Big Data AI Platform

Nov 5, 2022 · Big Data

How Alibaba’s Open‑Source Big Data Ecosystem Is Accelerating Like Moore’s Law

At the Yunqi Conference summit, Alibaba’s open‑source big data team reviewed 13 years of development, highlighted cloud‑native, real‑time, data‑lake and AI trends, and unveiled a new “Moore’s Law”‑style acceleration in open‑source big data technologies.

Flink

0 likes · 7 min read

How Alibaba’s Open‑Source Big Data Ecosystem Is Accelerating Like Moore’s Law

Meituan Technology Team

Nov 3, 2022 · Backend Development

Design and Implementation of Logan Real-Time Log System at Meituan

The article details Meituan’s end‑to‑end design and implementation of Logan, a high‑performance real‑time logging service for mobile apps, web, mini‑programs and IoT, covering background challenges, architecture layers, technology choices such as Flink and Elasticsearch, stability measures, deployment practices, achieved results and future plans.

Blue‑Green deploymentElasticsearchFlink

0 likes · 21 min read

Design and Implementation of Logan Real-Time Log System at Meituan

ByteDance Data Platform

Oct 28, 2022 · Big Data

How ByteDance’s BitSail is Revolutionizing Data Integration at Scale

BitSail, ByteDance’s open‑source data integration engine built on Flink, has evolved through three major versions to support batch, streaming and CDC modes, handling over 200,000 daily tasks across 20+ data sources, and aims to meet real‑time, cloud‑native integration demands.

Cloud NativeData IntegrationFlink

0 likes · 14 min read

How ByteDance’s BitSail is Revolutionizing Data Integration at Scale

DataFunSummit

Oct 21, 2022 · Big Data

Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg

This article details Xiaohongshu's data platform architecture and three real‑time lake initiatives—log ingestion, CDC ingestion, and lake analysis—showcasing how Apache Iceberg, Flink, and custom shuffling algorithms solve small‑file and cross‑cloud challenges while enabling schema evolution and future multi‑cloud optimizations.

Apache IcebergBig DataCDC

0 likes · 16 min read

Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg

DataFunTalk

Oct 19, 2022 · Big Data

Understanding Flink Table Store: Design, Usage, and Roadmap

Flink Table Store, an Apache Flink subproject, provides a unified stream‑batch storage layer with SQL‑based table APIs, addressing real‑time and offline data needs, detailing its design goals, usage patterns, architectural layers, implementation choices, and upcoming roadmap.

FlinkLSM‑TreeStreaming

0 likes · 14 min read

Understanding Flink Table Store: Design, Usage, and Roadmap

DataFunSummit

Oct 18, 2022 · Big Data

Feature Overview of Apache Kyuubi (Incubating) v1.5.0

The article presents a detailed technical walkthrough of Apache Kyuubi 1.5.0, covering its service‑oriented architecture, high‑availability design, multi‑engine extensions for Spark, Flink, Trino and Hive, enhanced engine‑sharing policies, POOL mode configuration, and the project’s future roadmap.

Apache KyuubiBig DataEngine Architecture

0 likes · 13 min read

Feature Overview of Apache Kyuubi (Incubating) v1.5.0

Liulishuo Tech Team

Oct 18, 2022 · Big Data

How to Build a Near‑Real‑Time Metric Management System with Flink, Kafka, and Trino

This article outlines the design and implementation of a near‑real‑time metric management platform at Liulishuo, detailing its data flow—from Kafka ingestion through Flink‑SQL processing into Hudi tables, Trino querying, metric configuration, lineage, visualization, alerting, scheduling, and future optimization plans.

FlinkHudiKafka

0 likes · 7 min read

How to Build a Near‑Real‑Time Metric Management System with Flink, Kafka, and Trino

DataFunTalk

Oct 17, 2022 · Big Data

Thoughts and Practices on ByteDance Streaming Data Warehouse and Real‑Time Service Analysis

The article presents ByteDance's challenges with massive real‑time data processing and describes how they integrated a streaming data warehouse with Flink Table Store, cloud‑native architecture, and real‑time service analysis to achieve low‑latency, high‑throughput analytics and end‑to‑end consistency.

FlinkReal-time analyticsStreaming

0 likes · 13 min read

Thoughts and Practices on ByteDance Streaming Data Warehouse and Real‑Time Service Analysis

ITPUB

Oct 15, 2022 · Big Data

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

This talk introduces the evolution of data lakes, outlines Apache Hudi’s core features, details the Flink‑Hudi integration architecture—including write pipelines, small‑file handling, and read strategies—covers real‑world use cases such as near‑real‑time DB ingestion, OLAP, and ETL, and previews upcoming Hudi roadmap items.

Apache HudiBig DataData Lake

0 likes · 21 min read

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

Xingsheng Youxuan Technology Community

Oct 14, 2022 · Big Data

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

This article explains how an e‑commerce company designed and implemented a modern data warehouse—combining batch Spark jobs, real‑time Flink streams, and Hudi data‑lake storage—to handle terabytes of daily logs, ensure data quality, and provide fast, reliable analytics for business decision‑making.

Data LakeETLFlink

0 likes · 16 min read

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

DataFunTalk

Oct 14, 2022 · Big Data

Exploring Flink and Apache Hudi for Streaming Data Lakes: Design, Practices, and Roadmap

This article presents a comprehensive overview of using Flink with Apache Hudi to build streaming data lake solutions, covering Hudi's background, core features, Flink‑Hudi integration design, practical use cases, recent roadmap updates, and a Q&A session.

Apache HudiData LakeFlink

0 likes · 19 min read

Exploring Flink and Apache Hudi for Streaming Data Lakes: Design, Practices, and Roadmap

Shopee Tech Team

Oct 13, 2022 · Big Data

Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee

Shopee tackled frequent Flink checkpoint failures caused by back‑pressure by adopting and extending the community’s Unaligned Checkpoint mechanism—adding overdraft buffers, improving legacy sources, introducing an aligned‑checkpoint timeout, enabling output‑buffer switching, merging small HDFS files, and fixing network‑buffer deadlocks—now running hundreds of jobs with stable UC deployment and plans to enable it universally.

Big DataCheckpoint OptimizationFlink

0 likes · 18 min read

Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee

Alibaba Cloud Developer

Oct 13, 2022 · Big Data

Ensuring Correctness in Stream Computing: Data Integrity Challenges and Engine Solutions

This article explores how stream computing systems achieve correct results by addressing data integrity, distinguishing consistency from correctness, formalizing integrity inference, and comparing implementations across major engines such as Flink, Kafka Streams, MillWheel, and Spark Structured Streaming.

Flinkcorrectnessdata integrity

0 likes · 28 min read

Ensuring Correctness in Stream Computing: Data Integrity Challenges and Engine Solutions

DataFunSummit

Oct 10, 2022 · Big Data

Stability Optimization Practices for Flink Jobs at Tencent

This article presents Tencent's practical experience in improving Flink job stability, covering the Oceanus platform, stability challenges, and concrete optimization techniques such as reducing failures, minimizing impact, accelerating recovery, and proactive issue detection, followed by a summary and future outlook.

Big DataFlinkReal‑Time Computing

0 likes · 12 min read

Stability Optimization Practices for Flink Jobs at Tencent

DeWu Technology

Oct 10, 2022 · Big Data

Offline and Real-Time User Profile Fusion Architecture

The architecture combines a nightly batch job that generates offline user profiles stored in HBase with a Flink‑based stream layer that lazily loads those profiles on app start and creates real‑time updates, then fuses both streams into a unified, timestamp‑ordered profile in Redis, forming a Lambda‑style pipeline.

Batch ProcessingFlinkHBase

0 likes · 10 min read

MaGe Linux Operations

Oct 9, 2022 · Big Data

Master Flink on Kubernetes: Step‑by‑Step Deployment Guide

This guide walks you through deploying Apache Flink on Kubernetes, covering runtime modes, building Docker images, creating ConfigMaps and Services, launching session and application clusters, submitting jobs, monitoring the Web UI, and cleaning up resources, all with practical code snippets and commands.

Big DataDockerFlink

0 likes · 26 min read

Master Flink on Kubernetes: Step‑by‑Step Deployment Guide

vivo Internet Technology

Oct 9, 2022 · Big Data

Design and Implementation of a Real-Time Marketing Automation Engine at vivo

This fifth installment explains vivo’s real‑time marketing automation engine, detailing its business need, layered architecture (access, processing, output, management, warehouse), scalable event‑queue design, dynamic configuration, unified dispatch, Flink‑based metric enrichment, and rule‑engine integration to achieve low‑latency, high‑throughput personalized targeting.

Event-Driven ArchitectureFlinkMessage Queue

0 likes · 13 min read

Design and Implementation of a Real-Time Marketing Automation Engine at vivo

Big Data Technology & Architecture

Oct 8, 2022 · Big Data

Flink CDC Tutorial: Sync MySQL Data to Hudi Data Lake Using SQL

This article provides a comprehensive guide on using Flink CDC with Debezium to capture MySQL changes, covering serialization, adding dependencies, configuring SQL client and Java/Scala APIs, creating source and sink tables, enabling checkpoints, and streaming data into a Hudi data lake.

CDCDataLakeFlink

0 likes · 10 min read

Flink CDC Tutorial: Sync MySQL Data to Hudi Data Lake Using SQL

DataFunTalk

Sep 28, 2022 · Big Data

Privacy Computing in Big Data AI: Challenges, Solutions, and PPML Case Studies

This presentation explores the background and current state of privacy computing, its relevance to big data and AI, discusses SGX and LibOS technologies, introduces the BigDL PPML solution for secure Spark/Flink workloads, and reviews real-world applications and future outlook.

Big DataFlinkPPML

0 likes · 13 min read

Privacy Computing in Big Data AI: Challenges, Solutions, and PPML Case Studies

ITPUB

Sep 24, 2022 · Big Data

How ByteDance Scales Real‑Time Data Warehouses with Hudi and Flink

This article details ByteDance's practical experience building real‑time data warehouses on a data lake using Hudi, Flink, and related optimizations, covering scenario analysis, architecture, performance challenges, and future roadmap for scalable, low‑latency analytics.

FlinkHudi

0 likes · 19 min read

How ByteDance Scales Real‑Time Data Warehouses with Hudi and Flink

ITPUB

Sep 22, 2022 · Big Data

What Is a Real‑Time Data Warehouse? Product, Solution, and Use Cases Explained

The article explains the concept of real‑time data warehouses, traces their evolution from early relational databases to modern streaming‑batch engines, discusses whether they are products or solutions, outlines typical application scenarios, selection criteria, and future trends in the big‑data ecosystem.

FlinkSparkcloud

0 likes · 10 min read

What Is a Real‑Time Data Warehouse? Product, Solution, and Use Cases Explained

DataFunTalk

Sep 11, 2022 · Big Data

Flink Table Store v0.2: Application Scenarios, Core Features, and Future Roadmap

This article introduces Flink Table Store v0.2, explains its four primary application scenarios—offline warehouse acceleration, partial update, pre‑aggregation rollup, and real‑time warehouse enhancement—details the core lake‑storage architecture, bucket management, append‑only mode, and outlines the project’s future roadmap and trade‑off considerations.

BatchFlinkLake Storage

0 likes · 16 min read

Flink Table Store v0.2: Application Scenarios, Core Features, and Future Roadmap

Sohu Tech Products

Sep 7, 2022 · Big Data

Introducing the Fire Framework: Annotation‑Driven Development for Spark and Flink

The Fire framework, open‑source by ZTO Express, provides a unified annotation‑based programming model for real‑time Spark and Flink jobs, dramatically reducing boilerplate, simplifying configuration, and enabling rapid development of large‑scale data processing tasks with concise Scala code examples.

Fire FrameworkFlinkReal-time Processing

0 likes · 12 min read

Introducing the Fire Framework: Annotation‑Driven Development for Spark and Flink

Bilibili Tech

Sep 6, 2022 · Big Data

Lancer: Evolution of Bilibili's Real-Time Streaming Architecture

Lancer, Bilibili’s real‑time streaming backbone, has evolved from a monolithic Flume pipeline to a log‑id‑isolated, Kubernetes‑native architecture where Go edge agents feed synchronous Kafka‑proxied gateways into per‑logid topics processed by dedicated Flink‑SQL jobs, delivering exactly‑once, back‑pressured, highly scalable data ingestion for billions of daily requests.

Big DataFlinkKafka

0 likes · 29 min read

Lancer: Evolution of Bilibili's Real-Time Streaming Architecture

Alibaba Cloud Big Data AI Platform

Sep 5, 2022 · Big Data

Scaling Alibaba TCC to Millions of RPS with a High‑Availability Real‑Time Data Warehouse

This article details how Alibaba's TCC platform evolved its architecture over multiple phases—from a legacy database to a high‑availability real‑time data warehouse built on Flink and Hologres—highlighting the challenges, solutions, and cost‑saving measures that enabled millions of RPS, terabytes of storage, and sub‑second query latency.

FlinkHologresReal-Time

0 likes · 21 min read

Scaling Alibaba TCC to Millions of RPS with a High‑Availability Real‑Time Data Warehouse

Meituan Technology Team

Sep 1, 2022 · Databases

AI-Powered Database Anomaly Detection Service: Feature Analysis, Algorithm Selection, and Real-Time Monitoring

The article details Meituan's database platform team's end‑to‑end design of an AI‑driven anomaly detection service, covering feature analysis of time‑series patterns, algorithm selection (MAD, boxplot, EVT), model training, real‑time detection with Flink, operational metrics, and future enhancements.

AI AlgorithmsBoxplotDatabase Anomaly Detection

0 likes · 19 min read

AI-Powered Database Anomaly Detection Service: Feature Analysis, Algorithm Selection, and Real-Time Monitoring

Huolala Tech

Sep 1, 2022 · Big Data

How HuoLala Built a Real‑Time Metrics Monitoring Platform for Flink

This article explains how HuoLala’s real‑time R&D platform redesigns Flink metric collection, routing, and alerting using a custom Kafka‑based pipeline, flexible dashboards, and multi‑level metric governance to improve observability, reduce latency, and ensure data quality.

FlinkKafkaReal-Time

0 likes · 22 min read

How HuoLala Built a Real‑Time Metrics Monitoring Platform for Flink