Showing 100 articles max
Architect-Kip
Architect-Kip
Mar 2, 2026 · Big Data

How to Build a Scalable Tiered Archive & Query System for MySQL Data

This article presents a comprehensive design for a layered storage and unified scheduling platform that archives MySQL historical data, reduces storage costs, ensures high‑performance queries, and enables efficient data analysis through tiered hot, warm, and cold storage using big‑data technologies.

FlinkHiveSpark
0 likes · 13 min read
How to Build a Scalable Tiered Archive & Query System for MySQL Data
DataFunSummit
DataFunSummit
Mar 1, 2026 · Big Data

How Ant Group’s Flex Engine Supercharges Flink with Vectorization

This article details Ant Group’s Flex vectorized engine built on Velox, covering the current state of vectorization, Flex’s architecture (Flink + Velox), core feature development, correctness guarantees, large‑scale deployment results, and future directions for full‑link vectorization and broader hardware support.

Big DataFlexFlink
0 likes · 18 min read
How Ant Group’s Flex Engine Supercharges Flink with Vectorization
Big Data Tech Team
Big Data Tech Team
Feb 26, 2026 · Big Data

How to Design Practical Data Architecture Diagrams: A Step‑by‑Step Guide

This guide walks data engineers through the entire process of creating clear, production‑ready data architecture diagrams—from identifying the diagram type and defining layers, to selecting tools, drawing step‑by‑step components, applying visual standards, avoiding common pitfalls, and validating the final design for stakeholders.

Diagrambig-datadata-architecture
0 likes · 11 min read
How to Design Practical Data Architecture Diagrams: A Step‑by‑Step Guide
Data STUDIO
Data STUDIO
Feb 21, 2026 · Big Data

Boost Python Performance Up to 50× Without Changing Your Code

Python’s reputation for slowness can be overcome by selecting the right tools—Numba, PyPy, CuPy, JAX, Ray, Joblib, async I/O, memory profilers, and big‑data frameworks—delivering speedups from 6× to over 50× with minimal or no code modifications.

AsyncGPUPerformance
0 likes · 22 min read
Boost Python Performance Up to 50× Without Changing Your Code
ITPUB
ITPUB
Feb 13, 2026 · Big Data

Real‑Time Sync of New MySQL Tables to Doris Using Flink CDC

This article explains how to extend a Flink CDC job that already syncs an entire MySQL database to Doris so that newly created tables are automatically created in Doris in real time, using the CdcTools utility, side‑output streams, and asynchronous I/O.

CDCCdcToolsFlink
0 likes · 9 min read
Real‑Time Sync of New MySQL Tables to Doris Using Flink CDC
Big Data Tech Team
Big Data Tech Team
Feb 12, 2026 · Big Data

Mastering the DWS Layer: Core Strategies for Scalable Data Warehouses

This article provides a comprehensive, business‑driven analysis of the Data Warehouse Service (DWS) layer, covering its core positioning, design goals, modeling and aggregation tactics, storage optimizations, typical challenges with practical solutions, and best‑practice recommendations for building efficient, cost‑effective data services.

DWS LayerData WarehousePerformance Optimization
0 likes · 8 min read
Mastering the DWS Layer: Core Strategies for Scalable Data Warehouses
StarRocks
StarRocks
Feb 11, 2026 · Big Data

How StarRocks and Apache Paimon Build a True Lakehouse Native Engine

This article details the deep integration of StarRocks with Apache Paimon, describing the unified architecture, version evolution, performance enhancements, time‑travel queries, native readers/writers, distributed planning, and future roadmap for achieving lakehouse‑native analytics at scale.

Apache PaimonLakehouseStarRocks
0 likes · 10 min read
How StarRocks and Apache Paimon Build a True Lakehouse Native Engine
DeWu Technology
DeWu Technology
Feb 9, 2026 · Big Data

How to Build a Production‑Ready Flink ClickHouse Sink with Dynamic Sharding, Batch‑by‑Size, and Robust Retry

This article presents a production‑grade Flink ClickHouse sink that solves common pain points such as lack of size‑based batching, static table schemas, and distributed‑table latency by introducing data‑size batching, dynamic table routing, local‑table writes, load‑balanced node discovery, back‑pressure queues, dual‑trigger flush, and recursive retry with node exclusion, all integrated with Flink checkpoint semantics for at‑least‑once guarantees.

BatchingCheckpointClickHouse
0 likes · 25 min read
How to Build a Production‑Ready Flink ClickHouse Sink with Dynamic Sharding, Batch‑by‑Size, and Robust Retry
DataFunSummit
DataFunSummit
Feb 8, 2026 · Big Data

Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges

The article explains how Kuaishou modernized its data lake by partnering with Apache Hudi to address latency, storage cost, and consistency issues in both AI and BI pipelines, detailing architectural changes, new ingestion tools, partitioning strategies, compaction mechanisms, performance gains and future plans.

AIBIBig Data
0 likes · 20 min read
Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges
DataFunSummit
DataFunSummit
Feb 7, 2026 · Big Data

How Flink Enables Real‑Time AI Inference and Agent Construction

This article explains Apache Flink’s stream processing fundamentals, introduces the open‑source Flink Agents framework for building event‑driven AI agents, details Alibaba Cloud’s Flink AI Function for real‑time LLM inference, and showcases demos, architecture, integration patterns, and practical use cases such as VOC analysis, live‑stream analytics, and intelligent operations.

Apache FlinkBig DataCloud Computing
0 likes · 24 min read
How Flink Enables Real‑Time AI Inference and Agent Construction
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 4, 2026 · Big Data

How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales

During Double‑11 mega‑sales, Taobao Group faced exploding OLAP query traffic, costly data sync pipelines, and slow near‑real‑time analytics, so they unified real‑time and batch data in Paimon, leveraged StarRocks for high‑performance lake queries, tuned cluster settings, and saved nearly ten‑million yuan annually while cutting refresh latency by 80%.

Big DataOLAPPaimon
0 likes · 22 min read
How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 2, 2026 · Big Data

Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

This article details how Alibaba Cloud EMR Serverless Spark combined with the Paimon lakehouse framework enables Taobao Flash Sale’s retail data team to achieve low‑latency, high‑throughput real‑time analytics, batch processing, and feature generation, outlining architecture evolution, performance gains, and practical Spark tuning techniques.

Big DataLakehousePaimon
0 likes · 18 min read
Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 2, 2026 · Big Data

How We Built a Scalable Lakehouse Architecture with StarRocks, Paimon, and Flink

This article details the evolution of a data warehouse at RenliJia from a MaxCompute‑centric setup to a modern lakehouse using StarRocks, Paimon, Flink, and Fluss, describing design goals, technical evaluations, implementation steps for offline, OLAP, and real‑time workloads, and the challenges and future plans that emerged.

Big DataData WarehouseFlink
0 likes · 25 min read
How We Built a Scalable Lakehouse Architecture with StarRocks, Paimon, and Flink
ByteDance Data Platform
ByteDance Data Platform
Feb 2, 2026 · Big Data

How StreamShield Powers Production‑Grade Resilience for Apache Flink at Massive Scale

ByteDance’s StreamShield delivers a three‑layer resiliency framework—engine self‑healing, hybrid replication at the cluster level, and chaos‑tested releases—that enables over 70,000 concurrent Flink jobs on 11 million CPU cores to meet strict SLAs with second‑level startup and robust fault tolerance.

Apache FlinkByteDanceReal‑Time Computing
0 likes · 6 min read
How StreamShield Powers Production‑Grade Resilience for Apache Flink at Massive Scale
Raymond Ops
Raymond Ops
Jan 30, 2026 · Big Data

Build an Enterprise‑Grade HDFS HA and YARN Scheduler from Scratch

This guide walks you through designing and deploying a highly available HDFS architecture with dual NameNodes, ZooKeeper‑based failover, and a tuned YARN resource scheduler, covering detailed configuration files, failover testing, performance tuning, monitoring, automated health checks, capacity planning, and best‑practice checklists for production‑grade big‑data platforms.

AutomationBig DataHA
0 likes · 28 min read
Build an Enterprise‑Grade HDFS HA and YARN Scheduler from Scratch
Radish, Keep Going!
Radish, Keep Going!
Jan 30, 2026 · Big Data

How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations

Uber tackled the challenge of replicating over 350 PB of data across on‑premise and cloud lakes by redesigning Hadoop Distcp, moving intensive tasks to the Application Master, parallelising copy‑listing and commit phases, and leveraging Uber‑mapper jobs to dramatically cut latency and improve resource efficiency.

Big DataDistcpHadoop
0 likes · 17 min read
How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations