Tagged articles
3675 articles
Page 18 of 37
DataFunTalk
DataFunTalk
Mar 24, 2022 · Big Data

Real‑time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents a JD.com BI engineer's case study on applying Flink SQL to real‑time dimension modeling, detailing two complex streaming scenarios, the technical difficulties of handling historical data and performance, and a component‑based solution architecture with future roadmap considerations.

Big DataFlinkReal-Time
0 likes · 13 min read
Real‑time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions
StarRocks
StarRocks
Mar 23, 2022 · Databases

Accelerating Zepp Health’s Analytics with StarRocks: An OLAP Case Study

Facing inflexible point‑lookup limits and slow query times on HBase, Zepp Health redesigned its massive event‑tracking data pipeline—migrating ingestion through Kafka, Flink, and Hudi to a StarRocks‑based OLAP layer—achieving sub‑100 ms average query latency, 20 % storage savings, and dramatically faster multi‑dimensional analytics.

Big DataFlinkHudi
0 likes · 9 min read
Accelerating Zepp Health’s Analytics with StarRocks: An OLAP Case Study
DataFunTalk
DataFunTalk
Mar 23, 2022 · Big Data

Iceberg Data Lake Query Optimization Practices and Governance

This talk by Tencent senior engineer Chen Liang covers Iceberg table format fundamentals, data lake ingestion, query processing, hidden partitioning, time‑travel, major features, optimization techniques such as compaction, bin‑packing, sorting and Z‑ordering, and outlines a future roadmap for improving performance and governance in big‑data environments.

Big DataData LakeFlink
0 likes · 12 min read
Iceberg Data Lake Query Optimization Practices and Governance
Tencent Tech
Tencent Tech
Mar 21, 2022 · R&D Management

Inside Tencent’s 2021 R&D Report: Coding Trends, AI Advances & Innovation

Tencent’s 2021 R&D Report details a 41% rise in engineering staff, 32 billion new code lines, Go becoming the top language, massive growth in open‑source contributions, breakthroughs in cloud OS, databases, AI, and a commitment to carbon‑neutral technology‑driven social impact.

AIBig DataR&D
0 likes · 8 min read
Inside Tencent’s 2021 R&D Report: Coding Trends, AI Advances & Innovation
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 15, 2022 · Big Data

How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture

This article explains why data lakes are essential for today’s analytics, outlines the three main user demands, defines data lakes, compares rule‑based and cost‑based optimizers, explores record‑oriented versus block‑oriented processing, and details StarRocks’ frontend‑backend architecture and benchmark results.

Analytics EngineBig DataData Lake
0 likes · 17 min read
How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture
DataFunTalk
DataFunTalk
Mar 15, 2022 · Big Data

Bilibili's Billion‑Scale Data Synchronization Using Apache SeaTunnel

This article details Bilibili's implementation of a hundred‑terabyte‑per‑day data synchronization pipeline, covering tool selection between DataX‑based Rider and SeaTunnel‑based AlterEgo, architecture design, performance tuning, logging optimization, rate‑limiting strategies, and comprehensive monitoring for large‑scale offline data ingestion and export.

Apache SeaTunnelBig DataClickHouse
0 likes · 13 min read
Bilibili's Billion‑Scale Data Synchronization Using Apache SeaTunnel
IT Architects Alliance
IT Architects Alliance
Mar 14, 2022 · Big Data

Comprehensive Guide to Kafka Architecture, Core Concepts, and Production Deployment

This article provides an in‑depth overview of Kafka, covering why messaging systems are needed, core concepts, cluster architecture, performance optimizations such as sequential disk writes and zero‑copy, hardware sizing, replication, consumer groups, offset management, rebalance strategies, and practical deployment and operational guidelines.

Big DataCluster DeploymentDistributed Messaging
0 likes · 35 min read
Comprehensive Guide to Kafka Architecture, Core Concepts, and Production Deployment
BaiPing Technology
BaiPing Technology
Mar 14, 2022 · Big Data

Mastering DataWorks & MaxCompute: A Complete Guide to Big Data Architecture and Governance

DataWorks, Alibaba Cloud’s comprehensive PaaS platform, combined with the serverless MaxCompute data warehouse, offers an integrated solution for data integration, development, quality, and services, while detailed naming and layer conventions ensure scalable, maintainable big‑data architectures and effective governance across ODS, CDM, DWD, DWS, and ADS layers.

Big DataData GovernanceDataWorks
0 likes · 8 min read
Mastering DataWorks & MaxCompute: A Complete Guide to Big Data Architecture and Governance
DataFunTalk
DataFunTalk
Mar 13, 2022 · Big Data

Tencent Data Lake Metadata Governance Practice and Architecture

This article presents Tencent's data lake metadata governance practice, covering data lake fundamentals, the 3+2 architecture of storage, compute and unified metadata, multi‑tenant design, the re‑implemented Hive Metastore for online catalog, performance optimizations, and offline data‑governance capabilities.

Big DataCloud ComputingData Lake
0 likes · 18 min read
Tencent Data Lake Metadata Governance Practice and Architecture
DevOps
DevOps
Mar 11, 2022 · Cloud Computing

Informationization vs. Digital Transformation: Definitions, Differences, and Their Impact on Chinese Enterprises

The article explains the definitions of informationization and digital transformation, compares their technical, demand, core‑goal, and ecosystem differences, and analyzes how digital technologies such as cloud, big data and AI are reshaping industries, enterprise strategies, talent needs, and overall competitiveness in China.

Big DataChinaDigital Transformation
0 likes · 14 min read
Informationization vs. Digital Transformation: Definitions, Differences, and Their Impact on Chinese Enterprises
vivo Internet Technology
vivo Internet Technology
Mar 9, 2022 · Big Data

Incremental Synchronization of Massive HBase Data to a Data Warehouse: Solution Overview and Performance Evaluation

The paper proposes a generic, timeRange‑based incremental extraction method for synchronizing tens of billions of HBase rows to a data warehouse, demonstrating that it avoids full‑table scans, automatically detects schema changes, and delivers significantly lower latency than Hive mapping or timestamp‑based approaches, and has been integrated into a unified big‑data platform.

Big DataHBaseIncremental Sync
0 likes · 8 min read
Incremental Synchronization of Massive HBase Data to a Data Warehouse: Solution Overview and Performance Evaluation
DataFunTalk
DataFunTalk
Mar 3, 2022 · Big Data

Youzan Data Platform and DP Data Development Platform: Architecture, Core Modules, and Scheduling System Upgrade

This article presents an in‑depth overview of Youzan's data platform, introduces the DP data development platform with its key features and workflow, details the core module architecture—including service, scheduling, and component layers—and explains the migration from Airflow to DolphinScheduler to improve performance, stability, and scalability.

Big DataData DevelopmentData Platform
0 likes · 14 min read
Youzan Data Platform and DP Data Development Platform: Architecture, Core Modules, and Scheduling System Upgrade
IT Xianyu
IT Xianyu
Mar 3, 2022 · Databases

Introducing SPL: An Open‑Source Structured Data Processing Language with Full SQL‑92 Capabilities

SPL is an open‑source structured data processing language that extends full SQL‑92 functionality to a wide range of data sources—including CSV, Excel, JSON, NoSQL and Hadoop—allowing developers to perform complex queries, multi‑step calculations, and mixed‑source analytics without a traditional relational database.

Big DataData IntegrationSPL
0 likes · 14 min read
Introducing SPL: An Open‑Source Structured Data Processing Language with Full SQL‑92 Capabilities
AntTech
AntTech
Mar 1, 2022 · Big Data

Graph Computing at Ant Group: From Fraud Prevention to Industry‑Wide Impact

The article explains how Ant Group leverages large‑scale graph computing—through its GeaBase and TuGraph platforms and a dedicated research team—to enhance real‑time fraud detection, drive industry standards, and explore future applications across finance, energy, and public services.

Ant GroupBig DataTuGraph
0 likes · 7 min read
Graph Computing at Ant Group: From Fraud Prevention to Industry‑Wide Impact
DataFunTalk
DataFunTalk
Mar 1, 2022 · Cloud Native

Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions

The presentation outlines Alibaba Cloud's native data lake solution built on Apache Iceberg, covering data lake fundamentals, cloud migration challenges, Iceberg's architecture and features, real‑time ingestion with Flink, unified metadata management, security guarantees, and testing practices to ensure reliable, scalable big‑data analytics.

Apache IcebergBig DataData Lake
0 likes · 16 min read
Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions
Architects Research Society
Architects Research Society
Feb 26, 2022 · Big Data

Introduction to Azure Data Lake Analytics (ADLA) and Its Architecture

This article introduces Azure Data Lake Analytics, explains how data lakes differ from traditional warehouses, outlines the ETL process, highlights the benefits of schema‑on‑read storage, and describes the four‑stage Azure data platform architecture for ingesting, storing, processing, and analyzing massive datasets.

AzureBig DataU-SQL
0 likes · 5 min read
Introduction to Azure Data Lake Analytics (ADLA) and Its Architecture
Kuaishou Big Data
Kuaishou Big Data
Feb 25, 2022 · Big Data

How Kuaishou Scales Data Sync: Architecture, Challenges, and Future Plans

This article details the design, evolution, and optimization of Kuaishou's data synchronization platform, covering business overview, architecture, key technologies, performance tuning, data source protection, incremental data lake integration, and future roadmap for a unified data fabric.

Big DataReal-time Processingarchitecture
0 likes · 15 min read
How Kuaishou Scales Data Sync: Architecture, Challenges, and Future Plans
DataFunTalk
DataFunTalk
Feb 25, 2022 · Big Data

Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization

This article explains how Tencent leverages Apache Iceberg together with Flink to build a real‑time data lake pipeline, covering data ingestion, Iceberg's snapshot‑based read/write model, compaction and governance services, Z‑order based query optimization, performance results, and future roadmap.

Apache IcebergBig DataData Lake
0 likes · 24 min read
Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 23, 2022 · Big Data

Understanding Mini‑Batch Streaming Aggregation in Flink SQL

This article explains Flink SQL’s streaming aggregation Mini‑Batch feature, covering its purpose, configuration parameters, underlying optimizer rules, operator implementations, watermark handling, buffer processing, and the optional Local‑Global two‑phase aggregation optimization for improving throughput and reducing state overhead in large‑scale data pipelines.

Big DataFlinkMini-Batch
0 likes · 10 min read
Understanding Mini‑Batch Streaming Aggregation in Flink SQL
DataFunTalk
DataFunTalk
Feb 23, 2022 · Big Data

NetEase Data Platform DataOps Practices for Improving Data Quality

This article details how NetEase's DataFunTalk presentation explores the growing data quality challenges in data development and demonstrates the application of DataOps principles—including pre‑ and post‑control mechanisms, sandbox environments, and automated governance tools—to systematically reduce defects, optimize resources, and ensure reliable data delivery across the company's diverse business lines.

Big DataData PlatformDataOps
0 likes · 14 min read
NetEase Data Platform DataOps Practices for Improving Data Quality
Architects' Tech Alliance
Architects' Tech Alliance
Feb 22, 2022 · Cloud Computing

Understanding China's “East Data West Computing” Initiative: Goals, Rationale, and Implementation

The “East Data West Computing” program is a national strategy that relocates computing workloads from data‑intensive eastern regions to resource‑rich western areas by building a network of data‑center hubs and clusters, aiming to balance supply and demand, improve energy efficiency, and boost overall computing capacity.

Big DataData CentersEast Data West Computing
0 likes · 7 min read
Understanding China's “East Data West Computing” Initiative: Goals, Rationale, and Implementation
ByteDance Data Platform
ByteDance Data Platform
Feb 21, 2022 · Big Data

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

This article examines how to design enterprise‑grade data warehouses by evaluating development convenience, ecosystem, decoupling, performance and security, compares Hive and SparkSQL along with other engines such as Presto, Doris and ClickHouse, and outlines best‑practice component selections for long‑running batch and interactive analytics.

Big DataETLPerformance
0 likes · 19 min read
Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL
DataFunTalk
DataFunTalk
Feb 19, 2022 · Big Data

Fundamentals of Data Middle Platform: Logic, Principles, and Practice

This article explains what a data middle platform is, why organizations need it, its core principles, technical architecture, and practical implementation guidelines, highlighting how it solves issues like inconsistent metrics, duplicate data construction, low query efficiency, poor data quality, and high development costs.

Big DataData ArchitectureData Middle Platform
0 likes · 14 min read
Fundamentals of Data Middle Platform: Logic, Principles, and Practice
Bilibili Tech
Bilibili Tech
Feb 18, 2022 · Big Data

Evolution of Bilibili's Data Retrieval Services and Lakehouse Architecture

Bilibili’s data retrieval journey progressed from a fragmented, chimney‑style pipeline to a unified Flink‑based service layer with the Ark construction system and Akuya SQL engine, and finally to an Iceberg‑driven lakehouse that eliminates data duplication, streamlines cross‑engine optimization, and offers platformized, low‑latency analytics.

Big DataBilibiliData Retrieval
0 likes · 14 min read
Evolution of Bilibili's Data Retrieval Services and Lakehouse Architecture
Alimama Tech
Alimama Tech
Feb 16, 2022 · Big Data

Target Group Discovery: Framework, Models, and Case Study

The article presents a target‑group discovery framework that combines goal definition, rule‑or model‑based segmentation, tiered metrics, benchmarking and quadrant analysis to identify and characterize advantageous, problematic, or weak consumer, product, or merchant sub‑groups, illustrated by a FMCG e‑commerce case study diagnosing high‑share, low‑growth categories.

BenchmarkingBig Datadata segmentation
0 likes · 13 min read
Target Group Discovery: Framework, Models, and Case Study
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 16, 2022 · Big Data

Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse

This article introduces Change Data Capture (CDC), compares query‑based and log‑based approaches, explains Debezium and ClickHouse, and provides detailed Flink CDC and Flink SQL CDC examples—including Java source code, custom deserialization schema, ClickHouse sink implementation, and required Maven dependencies—to synchronize MySQL data into ClickHouse in real time.

Big DataCDCClickHouse
0 likes · 17 min read
Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse
dbaplus Community
dbaplus Community
Feb 15, 2022 · Big Data

Mastering Data Warehouse Architecture: Concepts, Modeling Techniques, and Real‑Time Strategies

This comprehensive guide explains data warehouse fundamentals, architecture layers, modeling methods such as dimensional and entity modeling, metadata management, and the transition from offline to real‑time processing with Lambda and Kappa architectures, providing practical steps, best practices, and key terminology for building robust analytical platforms.

Big DataETLReal-time Processing
0 likes · 63 min read
Mastering Data Warehouse Architecture: Concepts, Modeling Techniques, and Real‑Time Strategies
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 15, 2022 · Big Data

Understanding Flink TaskManager Memory Model (Post‑1.10)

This article explains the official Flink memory model diagram, shows real‑world TaskManager memory parameters, and breaks down the five major memory components—including process, Flink, JVM heap, off‑heap, Metaspace, and overhead—providing configuration guidance for optimal resource allocation.

Big DataFlinkTaskManager
0 likes · 8 min read
Understanding Flink TaskManager Memory Model (Post‑1.10)
IT Architects Alliance
IT Architects Alliance
Feb 15, 2022 · Artificial Intelligence

How a Scalable Recommendation Engine Evolved: From V1.0 to V3.0

This article details the evolution of an e‑commerce recommendation system through three architectural versions, highlighting the initial simple design, the challenges that prompted vertical and horizontal splits, the introduction of a configurable pipeline and AB testing, and the final micro‑service‑based, dynamically configurable V3.0 architecture.

AIBig DataPipeline
0 likes · 14 min read
How a Scalable Recommendation Engine Evolved: From V1.0 to V3.0
DataFunTalk
DataFunTalk
Feb 13, 2022 · Big Data

How Kuaishou Built a Standardized Data Governance Evaluation System

This article outlines Kuaishou’s approach to establishing a standardized data governance evaluation framework, detailing the challenges of large‑scale data management, the design of assessment metrics across model, quality, and cost dimensions, and the practical strategies and operational mechanisms used to improve data asset health and business value.

Big DataEvaluation FrameworkKuaishou
0 likes · 21 min read
How Kuaishou Built a Standardized Data Governance Evaluation System
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 13, 2022 · Big Data

What's New in Elasticsearch 8.0 – Key Features and Changes

The article provides a comprehensive overview of Elasticsearch 8.0, highlighting major updates such as 7.x REST API compatibility headers, default-enabled security, system‑index protection, a new KNN search API, storage and indexing optimizations, PyTorch model support, and numerous deprecations and feature removals across the stack.

8.0APIBig Data
0 likes · 10 min read
What's New in Elasticsearch 8.0 – Key Features and Changes
DataFunTalk
DataFunTalk
Feb 12, 2022 · Big Data

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

This article introduces NetEase's internally incubated data lake project Arctic, explains the concept of data lakes, outlines NetEase's specific requirements for a unified streaming‑batch platform, details Arctic's core architecture, storage strategy, data‑merge mechanisms, current achievements, and future development plans.

Apache IcebergArcticBig Data
0 likes · 10 min read
NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap
Programmer DD
Programmer DD
Feb 12, 2022 · Databases

What’s New in Elasticsearch 8.0? Key Features and Migration Tips

Elasticsearch 8.0 introduces major changes such as 7.x REST API compatibility headers, default‑enabled security with registration tokens, protected system indices, a technical preview of KNN search, storage‑saving field encodings, faster geo‑point indexing, PyTorch model support for NLP, and numerous deprecations and improvements across aggregations, allocation, analysis, authentication, cluster coordination, and packaging.

APIBig DataElasticsearch
0 likes · 10 min read
What’s New in Elasticsearch 8.0? Key Features and Migration Tips
21CTO
21CTO
Feb 11, 2022 · Cloud Computing

What Will Shape Software Development in 2022? 20 Key Trends Revealed

The article surveys 2022 software‑development forecasts, covering centralized and edge cloud infrastructure, multi‑cloud adoption, containers, security, blockchain, AI, low‑code, databases, big‑data engines, streaming, DevOps observability, programming languages, front‑end frameworks, and mobile development, offering a comprehensive outlook for practitioners and decision‑makers.

2022 trendsBig Datasoftware development
0 likes · 21 min read
What Will Shape Software Development in 2022? 20 Key Trends Revealed
政采云技术
政采云技术
Feb 8, 2022 · Industry Insights

Unlocking Enterprise Value with a Data Middle Platform: Architecture & Indicators

This article traces the evolution from traditional data warehouses to modern data lakes and data middle platforms, explains why siloed data development hampers efficiency, and details the architecture and indicator‑library design used by Zhengcaiyun to achieve unified, reusable data services.

Big DataData GovernanceData Lakehouse
0 likes · 14 min read
Unlocking Enterprise Value with a Data Middle Platform: Architecture & Indicators
IT Architects Alliance
IT Architects Alliance
Feb 8, 2022 · Backend Development

Designing a Daily Million-Transaction Payment Reconciliation System

This article explains how to architect a payment reconciliation system that can reliably process tens of millions of transactions per day, covering the underlying logic, scalability challenges, data collection methods, big‑data integration, and step‑by‑step processing flows to ensure accurate financial matching.

Backend ArchitectureBig DataSpark
0 likes · 32 min read
Designing a Daily Million-Transaction Payment Reconciliation System
DataFunTalk
DataFunTalk
Feb 3, 2022 · Big Data

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

This article explains how Kuashou tackled latency and efficiency problems in large‑scale data pipelines by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, its architecture, model design, handling of bursty updates, back‑fill scenarios, and operational safeguards.

Big DataData LakeFlink
0 likes · 13 min read
Improving Data Processing Efficiency at Kuaishou with Apache Hudi
DataFunTalk
DataFunTalk
Jan 28, 2022 · Big Data

Real-Time Customer Data Platform (RT‑CDP) Architecture and Implementation at iFanFan

This article explains the concept, challenges, and key business goals of a real‑time Customer Data Platform, details the technology stack selection—including Nebula Graph, Apache Flink, Apache Beam, Kudu, and Doris—and describes the modular architecture, data model, identity service, streaming computation, storage layers, rule engine, operational results, and future directions.

Big DataCDPData Integration
0 likes · 43 min read
Real-Time Customer Data Platform (RT‑CDP) Architecture and Implementation at iFanFan
JD Retail Technology
JD Retail Technology
Jan 27, 2022 · Big Data

How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs

This article explains JD’s self‑developed data‑skew mitigation solution for Spark, detailing the problem of uneven key distribution, the limitations of the open‑source AQE implementation, and JD’s OptimizeSkewedJoinV2 algorithm that dramatically reduces stage latency in large‑scale join workloads.

Adaptive Query ExecutionBig DataData Skew
0 likes · 13 min read
How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs
DataFunTalk
DataFunTalk
Jan 27, 2022 · Big Data

Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing

This article introduces Kyuubi, the first NetEase project contributed to the Apache Foundation, describing its core features, multi‑tenant architecture, Spark‑based execution engine, cloud‑native capabilities, and real‑world use cases within NetEase’s data‑warehouse, ad‑hoc, and internal systems, along with performance gains and community resources.

ApacheBig DataKyuubi
0 likes · 23 min read
Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing
IT Xianyu
IT Xianyu
Jan 27, 2022 · Big Data

Installing Apache Hive on macOS with Hadoop and MySQL Metastore

This tutorial provides step‑by‑step instructions for installing Hadoop 3.1.1, Homebrew, Hive, and configuring MySQL as Hive's metastore on macOS, including environment variable setup, hive‑site.xml configuration, MySQL connector placement, schema initialization, and verification commands.

Big DataHadoopInstallation
0 likes · 6 min read
Installing Apache Hive on macOS with Hadoop and MySQL Metastore
dbaplus Community
dbaplus Community
Jan 26, 2022 · Big Data

Why Does Elasticsearch Aggregate Faster with Fewer Terms? Uncover the Secrets

This article examines a real‑world Elasticsearch cluster handling hundreds of terabytes, explains why high‑cardinality aggregations can be slower, and shows how setting execution_hint=map and tuning doc_values dramatically improves aggregation performance for ultra‑high‑concurrency workloads.

Big DataData AnalyticsElasticsearch
0 likes · 12 min read
Why Does Elasticsearch Aggregate Faster with Fewer Terms? Uncover the Secrets
Architects Research Society
Architects Research Society
Jan 25, 2022 · Big Data

Azure Data Lake Storage Gen2: Design Guide, Best Practices, and Operational Considerations

This guide provides a comprehensive overview of Azure Data Lake Storage Gen2, covering when to use it, key design considerations, data organization strategies, access control models, file formats, cost‑optimization techniques, monitoring approaches, and performance‑tuning tips for large‑scale big‑data workloads.

ADLS Gen2AzureBig Data
0 likes · 41 min read
Azure Data Lake Storage Gen2: Design Guide, Best Practices, and Operational Considerations
DataFunTalk
DataFunTalk
Jan 25, 2022 · Big Data

Summary of Flink Forward Asia 2021: Community Growth, Cloud‑Native Deployment, Streaming‑Batch Integration, and Machine Learning

The article provides a comprehensive English summary of the 2021 Flink Forward Asia conference, covering community statistics, cloud‑native deployment modes, fault‑tolerance checkpoint advances, the evolution of streaming‑batch integration, the introduction of Streaming Warehouse, Flink ML 2.0, real‑time use cases at ByteDance and ICBC, Pravega storage innovations, and concluding reflections on the future of real‑time big data processing.

Apache FlinkBig Data
0 likes · 25 min read
Summary of Flink Forward Asia 2021: Community Growth, Cloud‑Native Deployment, Streaming‑Batch Integration, and Machine Learning
IT Architects Alliance
IT Architects Alliance
Jan 25, 2022 · Operations

Design and Architecture of a Shared Resource Platform and Its Technical System

This document outlines the logical and technical architecture of a government shared resource platform, describing application system upgrades, data collection and analysis, multi‑layer system design, standards compliance, interface management, and overall system integration for improved service quality and decision support.

Big DataData IntegrationGovernment IT
0 likes · 23 min read
Design and Architecture of a Shared Resource Platform and Its Technical System
DataFunSummit
DataFunSummit
Jan 23, 2022 · Big Data

MobTech's Integrated Data Governance Practices and Architecture

This article presents MobTech's comprehensive data governance and security practices, covering the necessity of governance, challenges in large‑scale data environments, the full‑link governance chain, modular architecture, and specific implementations for financial risk‑control scenarios.

Big DataData ArchitectureData Governance
0 likes · 19 min read
MobTech's Integrated Data Governance Practices and Architecture
DataFunTalk
DataFunTalk
Jan 22, 2022 · Big Data

Alibaba Cloud Data Integration (DataX) Architecture, Design Principles, and Solution Overview

This presentation details Alibaba Cloud DataWorks Data Integration (DataX), covering its architecture, core design principles, offline and real‑time synchronization mechanisms, deployment modes, product positioning, use‑case scenarios, and its role within the broader DataWorks ecosystem, highlighting its capabilities for large‑scale data movement and processing.

Alibaba CloudBig DataData Integration
0 likes · 19 min read
Alibaba Cloud Data Integration (DataX) Architecture, Design Principles, and Solution Overview
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 18, 2022 · Big Data

Data Warehouse Data Quality Measurement Standards

The article outlines four key dimensions for evaluating data warehouse data quality—correctness, completeness, timeliness, and consistency—explains common consistency issues such as differing metric values across models, cross‑dimensional aggregations, and real‑time versus batch calculations, and proposes organizational and review mechanisms to mitigate these problems.

Big DataConsistencyData Governance
0 likes · 9 min read
Data Warehouse Data Quality Measurement Standards
DataFunTalk
DataFunTalk
Jan 16, 2022 · Big Data

Time Series Database Capabilities and Application Scenarios in IoT, Smart Cities, and Edge Computing

This article explains the fundamentals of time‑series data, outlines the architecture and core technical advantages of Baidu Cloud's TSDB, and demonstrates how the database powers IoT, smart‑city, industrial, power‑grid, and autonomous‑driving use cases through multi‑level storage, distributed query optimization, and edge‑cloud integration.

Big DataCloud ComputingData Analytics
0 likes · 11 min read
Time Series Database Capabilities and Application Scenarios in IoT, Smart Cities, and Edge Computing
21CTO
21CTO
Jan 13, 2022 · Fundamentals

How to Achieve Data Maturity: Turning Data into a Strategic Product

The article explains why data maturity is essential for modern enterprises, defines its three pillars—people, tools, and readiness—shows how treating data as a product follows the same principles as great products, and outlines the four S (Speed, Scale, Simplicity, SQL) that guide a mature data ecosystem.

Big DataData GovernanceData Product
0 likes · 6 min read
How to Achieve Data Maturity: Turning Data into a Strategic Product
TAL Education Technology
TAL Education Technology
Jan 13, 2022 · Cloud Native

Offline Mixed Deployment with Kubernetes: Architecture, Implementation, and Performance Evaluation for Big Data Workloads

This article describes a cloud‑native offline mixed‑deployment solution that leverages Kubernetes to share resources between big‑data clusters and business services, outlines its implementation steps, presents detailed performance comparisons between Yarn and Kubernetes using TPC‑DS, Spark, and Terasort workloads, and discusses production experience and future plans.

Big DataCloud NativePerformance Testing
0 likes · 8 min read
Offline Mixed Deployment with Kubernetes: Architecture, Implementation, and Performance Evaluation for Big Data Workloads
Shopee Tech Team
Shopee Tech Team
Jan 13, 2022 · Big Data

Engineering Practices and Performance Optimizations of Apache Druid for Real‑Time OLAP at Shopee

Shopee’s engineering team scaled a 100‑node Apache Druid cluster for real‑time OLAP by redesigning the Coordinator load‑balancing algorithm, adding incremental metadata pulls, introducing a segment‑merged result cache, and building exact‑count and flexible sliding‑window operators, while planning cloud‑native deployment.

Apache DruidBig DataBitmap Index
0 likes · 17 min read
Engineering Practices and Performance Optimizations of Apache Druid for Real‑Time OLAP at Shopee
DataFunSummit
DataFunSummit
Jan 12, 2022 · Big Data

Exploring JD's Big Data Security and Distributed Permission System: Architecture, Principles, and Practices

This article presents JD's comprehensive big‑data security framework and distributed permission system, detailing the overall planning of the security center, data lifecycle protection strategies, core modules such as subjects, resources, policy language, and high‑performance access control, and how they address national compliance, business scalability, and technical challenges.

Big DataDistributed SystemsJD.com
0 likes · 11 min read
Exploring JD's Big Data Security and Distributed Permission System: Architecture, Principles, and Practices
StarRocks
StarRocks
Jan 12, 2022 · Big Data

How Flink + StarRocks Deliver Lightning‑Fast Real‑Time Data Warehousing

This article explains the evolution, challenges, and technical solutions for building an end‑to‑end real‑time data warehouse by combining Apache Flink's stream processing with StarRocks' ultra‑fast OLAP engine, covering architecture, data models, integration methods, best‑practice cases, and future roadmap.

Big DataFlinkOLAP
0 likes · 21 min read
How Flink + StarRocks Deliver Lightning‑Fast Real‑Time Data Warehousing
DataFunTalk
DataFunTalk
Jan 11, 2022 · Big Data

Interview with Wang Feng (Mo Wen): The Future of Apache Flink and Streaming Warehouses

In an exclusive InfoQ interview, Apache Flink community leader Wang Feng (aka Mo Wen) outlines the evolution of Flink toward a Streaming Warehouse, detailing recent technical advances, use‑case scenarios, and the upcoming Dynamic Table storage that aim to unify stream and batch processing for real‑time data‑warehouse workloads.

Apache FlinkBig DataDynamic Table
0 likes · 16 min read
Interview with Wang Feng (Mo Wen): The Future of Apache Flink and Streaming Warehouses
Top Architect
Top Architect
Jan 9, 2022 · Information Security

Technical Analysis and Recent Updates of Xi'an “One Code Pass” System

The article reviews the Xi'an “One Code Pass” health‑code platform, covering its award recognition, recent service outages, capacity‑planning calculations, security‑platform procurement, Ministry engineer inspection, and the identified technical bottlenecks such as lack of CDN for static assets and insufficient outbound bandwidth.

Big DataInformation SecurityOne Code Pass
0 likes · 7 min read
Technical Analysis and Recent Updates of Xi'an “One Code Pass” System
21CTO
21CTO
Jan 8, 2022 · Big Data

How Amazon’s Intelligent Lakehouse Redefines Big Data Architecture

The article examines Amazon’s Intelligent Lakehouse architecture, tracing its evolution from early data‑lake‑warehouse integrations to a modern, serverless, secure, and AI‑enhanced platform that unifies data storage, governance, and analytics to lower big‑data costs and boost agility.

Big DataData GovernanceData Lake
0 likes · 12 min read
How Amazon’s Intelligent Lakehouse Redefines Big Data Architecture
DataFunTalk
DataFunTalk
Jan 8, 2022 · Big Data

Lakehouse: Concepts, Architecture, Implementation, and Cloud Practices

This article provides a comprehensive overview of the Lakehouse paradigm, tracing its origins from traditional data warehouses and data lakes, comparing architectures, detailing core components such as Delta Lake and Iceberg, and illustrating practical cloud implementations and future directions.

Apache IcebergBig DataCloud Data Platform
0 likes · 14 min read
Lakehouse: Concepts, Architecture, Implementation, and Cloud Practices
Programmer DD
Programmer DD
Jan 8, 2022 · Big Data

How Flink’s Streaming Warehouse Is Redefining Real‑Time Data Lakes

This interview explores Apache Flink’s evolution toward a Streaming Warehouse, detailing its stream‑batch integration, new CDC‑based data integration, the Dynamic Table storage architecture, and how these innovations aim to simplify and accelerate real‑time big‑data analytics.

Apache FlinkBig DataDynamic Table
0 likes · 17 min read
How Flink’s Streaming Warehouse Is Redefining Real‑Time Data Lakes
HomeTech
HomeTech
Jan 6, 2022 · Operations

Design and Implementation of a Centralized Database Log Collection and Analysis Platform

This article describes the background, architecture, and implementation of a centralized database log collection and analysis platform built in 2021, detailing how logs from hosts, containers, and databases are normalized, streamed through Kafka, processed with Flink, stored in Elasticsearch, visualized with Kibana, and extended with alerting and configuration management to improve fault diagnosis and lay the groundwork for future AI‑driven operations.

Big DataKibanaMonitoring
0 likes · 5 min read
Design and Implementation of a Centralized Database Log Collection and Analysis Platform
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 6, 2022 · Big Data

Inside Alibaba Cloud’s MRACC Engine: How It Won the TPCx‑BB Benchmark

Alibaba Cloud’s self‑developed MRACC (Apasara Compute MapReduce Accelerator) leveraged hardware‑software integration, Spark and Hadoop optimizations, and eRDMA networking to achieve the top TPCx‑BB SF3000 performance, delivering up to 2‑3× faster SQL queries and 30% faster Spark shuffle, with significant cost efficiency gains.

Big DataRDMAbenchmark
0 likes · 9 min read
Inside Alibaba Cloud’s MRACC Engine: How It Won the TPCx‑BB Benchmark
Volcano Engine Developer Services
Volcano Engine Developer Services
Jan 4, 2022 · Big Data

How ByteDance Scales EB-Level Data: Architecture, BP Model & Real-Time Insights

ByteDance’s data platform, built over seven years, now handles exabyte-scale data and over 100 million TPS, using a hybrid “middle‑platform + Business Partner” model, custom engines like ClickHouse/ByteHouse, agile governance, and a suite of products to support internal and external businesses, illustrating large-scale big-data engineering practices.

Big DataByteDanceClickHouse
0 likes · 22 min read
How ByteDance Scales EB-Level Data: Architecture, BP Model & Real-Time Insights
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 4, 2022 · Big Data

Big Data Mastery Roadmap: Learning Path, Resources, Future Trends and Interview Guidance

This comprehensive guide outlines a step‑by‑step learning roadmap for aspiring big data professionals, covering fundamentals, programming languages, Linux, databases, distributed theory, networking, offline and real‑time computing, data governance, warehouses, toolchains, video/book recommendations, future industry trends, interview tips, and community resources.

Big DataData GovernanceDistributed Systems
0 likes · 42 min read
Big Data Mastery Roadmap: Learning Path, Resources, Future Trends and Interview Guidance
DataFunTalk
DataFunTalk
Jan 3, 2022 · Databases

Pegasus: Architecture, New Features, Ecosystem, and Community Overview

This article introduces Pegasus, a distributed key‑value store, covering its background, system architecture, double‑WAL design, performance benchmarks, recent features such as hot backup, bulk load, access control, partition split, as well as its ecosystem tools and community development plans.

Big DataHot BackupPEGASUS
0 likes · 12 min read
Pegasus: Architecture, New Features, Ecosystem, and Community Overview
JavaEdge
JavaEdge
Jan 2, 2022 · Big Data

Mastering ZooKeeper: Core Concepts, Architecture, and Practical Setup

This article provides a comprehensive overview of ZooKeeper, covering its role in distributed systems, common use cases, source code setup, serialization and persistence mechanisms, network communication models, and the watcher workflow, enabling developers to understand and deploy ZooKeeper effectively.

Big DataPersistenceWatcher
0 likes · 12 min read
Mastering ZooKeeper: Core Concepts, Architecture, and Practical Setup
DataFunTalk
DataFunTalk
Jan 1, 2022 · Big Data

JD's Flink Journey: Evolution, Optimizations, and Future Directions

This article details JD's adoption of Flink for real‑time computing, covering its evolution from Storm to Flink on Kubernetes, the platform architecture, major optimization techniques such as preview topology, backpressure handling, dynamic rebalance, checkpoint‑as‑savepoint, and outlines future plans including stream‑batch integration, stability improvements, intelligent operations, and AI integration.

Big DataFlinkJD
0 likes · 10 min read
JD's Flink Journey: Evolution, Optimizations, and Future Directions
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 31, 2021 · Big Data

Apache SeaTunnel Joins the Apache Incubator: Overview, Features, and Real‑World Use Cases

SeaTunnel, the China‑originated data‑integration platform built on Spark and Flink, has been accepted into the Apache Incubator, and this article introduces its history, architecture, plugin ecosystem, deployment requirements, and numerous enterprise deployments across batch and streaming big‑data scenarios.

ApacheBig DataData Integration
0 likes · 7 min read
Apache SeaTunnel Joins the Apache Incubator: Overview, Features, and Real‑World Use Cases
IT Architects Alliance
IT Architects Alliance
Dec 31, 2021 · Industry Insights

A Complete 19‑Part Knowledge Map for Software Architects

The article presents a detailed 19‑section knowledge map for software architects, covering everything from core responsibilities and fundamentals to distributed caching, messaging, load balancing, performance testing, OS, algorithms, networking, databases, JVM, micro‑services, DDD, security, high availability, big data, and blockchain, with visual mind‑maps for each topic.

Big DataBlockchainDistributed Systems
0 likes · 4 min read
A Complete 19‑Part Knowledge Map for Software Architects