Tagged articles
3675 articles
Page 17 of 37
Baidu Geek Talk
Baidu Geek Talk
Jun 15, 2022 · Big Data

Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges

The article proposes replacing the traditional multi‑layered data‑warehouse architecture (ODS‑DWD‑DWS‑ADS) with a single, column‑store wide‑table per business theme, achieving roughly 30 % storage savings and faster queries, while acknowledging higher ETL complexity, back‑tracking costs, and production timing challenges.

Big DataETLParquet
0 likes · 11 min read
Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges
dbaplus Community
dbaplus Community
Jun 14, 2022 · Big Data

How Qunar Built a Scalable BI Platform for Real‑Time Analytics and Self‑Service Reporting

This article details Qunar's multi‑year journey of designing and evolving a full‑stack BI platform—covering data ingestion, storage, query engines, self‑service analytics, and real‑time OLAP—by iterating through three development phases, selecting technologies such as Impala, Kudu, ClickHouse and Apache Druid, and addressing performance, usability and governance challenges to empower business users with fast, reliable data insights.

Apache DruidBIBig Data
0 likes · 24 min read
How Qunar Built a Scalable BI Platform for Real‑Time Analytics and Self‑Service Reporting
AntTech
AntTech
Jun 14, 2022 · Big Data

Insights on Graph Computing: Technology, Applications, and Future Directions

Professor Chen Wenguang discusses how graph computing—originating from graph theory—offers a powerful way to model relationships across industries, its rapid development in China, challenges in scaling, integration with AI via graph neural networks, and the collaborative efforts needed between academia and industry to advance the field.

AIBig DataGraph Processing
0 likes · 17 min read
Insights on Graph Computing: Technology, Applications, and Future Directions
DeWu Technology
DeWu Technology
Jun 13, 2022 · Operations

How to Build a Minute‑Level Order Fulfillment Simulation Platform with DataWorks

This article outlines the design and implementation of a minute‑level order‑fulfillment timeliness simulation platform, detailing its background, objectives, challenges, architecture built on Alibaba Cloud DataWorks, core workflow nodes, domain model, ER diagram, JSON task templates, and future extensions for supply‑chain routing.

Big DataDataWorksarchitecture
0 likes · 11 min read
How to Build a Minute‑Level Order Fulfillment Simulation Platform with DataWorks
NetEase Game Operations Platform
NetEase Game Operations Platform
Jun 10, 2022 · Databases

Apache Doris Deployment and Optimization at NetEase Interactive Entertainment

This article details NetEase Interactive Entertainment's adoption of Apache Doris for large‑scale game data analytics, covering background, Doris architecture, cluster governance, tablet and compaction tuning, scaling strategies, monitoring, alerting, and fault‑handling practices to improve performance and stability.

Apache DorisBig DataCluster Management
0 likes · 22 min read
Apache Doris Deployment and Optimization at NetEase Interactive Entertainment
DataFunTalk
DataFunTalk
Jun 5, 2022 · Big Data

JD Big Data Platform: Cross‑Region and Tiered Storage Architecture and Practices

This article presents JD's large‑scale big‑data platform, detailing its overall architecture, the challenges of cross‑region storage, the design of a unified cross‑domain data synchronization mechanism, and the implementation of tiered storage to improve performance, cost efficiency, and data reliability across multi‑datacenter clusters.

Big DataData PlatformHDFS
0 likes · 15 min read
JD Big Data Platform: Cross‑Region and Tiered Storage Architecture and Practices
DataFunSummit
DataFunSummit
Jun 3, 2022 · Big Data

Building and Optimizing JD Retail OLAP Platform: Architecture, Management, and Performance Techniques

This article details JD Retail's OLAP platform construction, covering control plane design, architecture, business and operation management, real‑time data updates, materialized view usage, join optimizations, high‑concurrency and high‑throughput scenarios, and promotional preparation strategies, illustrated with diagrams and performance metrics.

Big DataClickHouseDistributed Systems
0 likes · 20 min read
Building and Optimizing JD Retail OLAP Platform: Architecture, Management, and Performance Techniques
vivo Internet Technology
vivo Internet Technology
May 31, 2022 · Big Data

Kafka Load Balancing and Cruise Control: Concepts, Manual Migration, and Deployment

Kafka’s server‑side load imbalance, caused by static replica placement on broker disks, makes manual replica migration infeasible at scale, but Cruise Control automates metric collection, analysis, and execution of fine‑grained rebalance plans—including broker de‑commissioning and leader dispersion—allowing large clusters to expand and operate efficiently.

Big DataCluster ManagementCruise Control
0 likes · 21 min read
Kafka Load Balancing and Cruise Control: Concepts, Manual Migration, and Deployment
Tencent Cloud Developer
Tencent Cloud Developer
May 31, 2022 · Artificial Intelligence

Scalable Graph Neural Architecture Search System (PaSca) – WWW 2022 Best Student Paper

PaSca, a scalable graph neural architecture search system that separates message aggregation from updates, explores over 150,000 GNN designs with multi‑objective optimization, delivers models that outperform traditional GNNs in accuracy, memory and speed, has been open‑sourced and deployed at Tencent for risk control, recommendation and fraud detection, and earned the WWW 2022 Best Student Paper award.

Big DataNeural Architecture SearchScalable Systems
0 likes · 11 min read
Scalable Graph Neural Architecture Search System (PaSca) – WWW 2022 Best Student Paper
Big Data Technology & Architecture
Big Data Technology & Architecture
May 31, 2022 · Databases

Vectorization and Roaring Bitmap Techniques in Database Query Execution

This article explains how classic SQL execution engines use the volcano model and expression trees, discusses their performance drawbacks, introduces vectorized execution to reduce overhead, and describes Roaring Bitmap compression methods with container types for efficient storage and processing of integer sets.

Big DataDatabase EngineOperator Tree
0 likes · 10 min read
Vectorization and Roaring Bitmap Techniques in Database Query Execution
Bilibili Tech
Bilibili Tech
May 31, 2022 · Big Data

Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices

Bilibili migrated its massive offline platform from Hive to Spark using an automated SQL rewrite and dual‑run verification, cutting execution time over 40% and resource use 30%, while introducing small‑file merging, shuffle stability, runtime filters, data‑skipping, lineage tracking, auto‑parameter tuning, and metastore federation for robust large‑scale processing.

Big DataSparkdata engineering
0 likes · 30 min read
Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
May 30, 2022 · Big Data

Doris Architecture, Principles, and Key Features Overview

This article provides a comprehensive overview of Doris's architecture—including its FE and BE components, metadata management, data organization, execution planning—and details its major features such as adaptive join aggregation, vectorized execution, materialized views, and Elasticsearch integration, supplemented with example DDL and query code.

Big DataDatabase ArchitectureElasticsearch
0 likes · 7 min read
Doris Architecture, Principles, and Key Features Overview
Architect's Tech Stack
Architect's Tech Stack
May 28, 2022 · Big Data

Data Lake Challenges and the Open SPL Computing Engine

The article examines the inherent trade‑offs of data lakes—maintaining raw data, enabling efficient computation, and keeping costs low—explains why traditional data‑warehouse approaches fall short, and introduces the open‑source SPL engine that provides multi‑source, file‑based, high‑performance analytics to overcome these limitations.

Big DataData LakeETL
0 likes · 12 min read
Data Lake Challenges and the Open SPL Computing Engine
vivo Internet Technology
vivo Internet Technology
May 25, 2022 · Big Data

Understanding Druid Metadata Management and Architecture

Apache Druid manages metadata through a layered, distributed system where the Overlord coordinates ingestion tasks, MiddleManagers launch Peons to create segments, Coordinators and Historical nodes store and serve segment data, Brokers route queries, while MySQL, Zookeeper, memory, and local files synchronize metadata for fault‑tolerant, high‑performance OLAP analytics.

Big DataDruidMetadata Management
0 likes · 19 min read
Understanding Druid Metadata Management and Architecture
Architect
Architect
May 25, 2022 · Big Data

Metadata Infrastructure and Governance in Bilibili's Data Platform

The article details how Bilibili built a unified metadata infrastructure—including a URN‑based model, collection pipelines, quality assurance, storage in TiDB/ES/HugeGraph, and query services—to support data discovery, lineage, impact analysis, and governance across its growing data platform.

Big DataData CatalogData Governance
0 likes · 21 min read
Metadata Infrastructure and Governance in Bilibili's Data Platform
DataFunTalk
DataFunTalk
May 24, 2022 · Big Data

Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake

This article explains how Apache Flink integrates with Apache Hudi to enable real‑time data lake ingestion, covering the evolution from traditional data warehouses to data lakes, Hudi’s core concepts such as timeline and file grouping, copy‑on‑write vs merge‑on‑read modes, and Flink’s CDC‑based ETL pipeline.

Big DataCDCData Lake
0 likes · 18 min read
Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake
21CTO
21CTO
May 23, 2022 · Big Data

What Walmart’s Beer‑and‑Diaper Insight Reveals About Big Data and Statistics

An amusing Walmart story about beer and diapers illustrates how big‑data analysis uncovers hidden consumer patterns, leading to targeted promotions, while the article expands on why statistics remains essential in the data‑science era, the challenges of learning it, and recommends a comprehensive R‑based statistics guide.

Big DataLearning ResourcesR language
0 likes · 6 min read
What Walmart’s Beer‑and‑Diaper Insight Reveals About Big Data and Statistics
Baidu Geek Talk
Baidu Geek Talk
May 23, 2022 · Industry Insights

How Baidu Scales Real-Time Content Safety for Millions of Mini‑Programs

This article explains Baidu's evolving inspection scheduling system for its smart mini‑programs, detailing the challenges of massive page volumes, the V1.0 offline architecture, the V2.0 real‑time enhancements, resource constraints, deduplication logic, and the measurable improvements in risk detection and ecosystem health.

Big DataCloud ComputingContent Safety
0 likes · 17 min read
How Baidu Scales Real-Time Content Safety for Millions of Mini‑Programs
DataFunTalk
DataFunTalk
May 21, 2022 · Big Data

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

This talk presents Xiaomi's design and deployment of an elastic scheduling system for Hadoop YARN, covering background analysis, resource‑pool strategy, auto‑scaling architecture, stability challenges, label‑based resource isolation, Spark shuffle handling, cost‑saving results and future plans.

Big DataHadoopResource Management
0 likes · 16 min read
Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN
DataFunTalk
DataFunTalk
May 19, 2022 · Big Data

SeaTunnel: Distributed Data Integration Platform and Its Application in Traffic Management

This article introduces Apache SeaTunnel, a distributed, high‑performance data integration platform built on Spark and Flink, outlines its technical features, workflow, and plugin ecosystem, and details a concrete traffic‑management use case involving incremental Oracle‑to‑warehouse data synchronization with Spark resources and scheduled shell scripts.

Apache FlinkApache SparkBig Data
0 likes · 12 min read
SeaTunnel: Distributed Data Integration Platform and Its Application in Traffic Management
IT Architects Alliance
IT Architects Alliance
May 19, 2022 · Big Data

How Apache Kylin Enables Sub‑Second OLAP on Massive Data Sets

Apache Kylin leverages pre‑computed OLAP cubes on Hadoop/Spark/Flink to deliver sub‑second query responses for massive datasets, detailing its architecture, integration with BI platforms, user security, cube building, monitoring, and storage using HBase, illustrating how it overcomes big‑data analytical challenges.

Apache KylinBig DataHBase
0 likes · 12 min read
How Apache Kylin Enables Sub‑Second OLAP on Massive Data Sets
Big Data Technology & Architecture
Big Data Technology & Architecture
May 18, 2022 · Databases

Understanding ClickHouse Distributed JOIN Implementation and Best Practices

This article explains ClickHouse's single‑node and distributed JOIN mechanisms, compares ordinary, GLOBAL, Broadcast, Shuffle and Colocate JOINs, illustrates execution flows with code examples, and provides practical recommendations to reduce join size, avoid query amplification, and leverage data pre‑distribution for optimal performance.

Big DataClickHousePerformance
0 likes · 10 min read
Understanding ClickHouse Distributed JOIN Implementation and Best Practices
Alibaba Cloud Developer
Alibaba Cloud Developer
May 18, 2022 · Big Data

Why Delta Lake Is Revolutionizing Data Lakes with ACID Guarantees

This article explains how Delta Lake adds reliability to data lakes by offering ACID transactions, scalable metadata, and unified batch‑and‑stream processing, outlines the challenges it solves, details its implementation principles, and demonstrates a practical demo for building an integrated data warehouse.

ACIDBig DataData Lake
0 likes · 9 min read
Why Delta Lake Is Revolutionizing Data Lakes with ACID Guarantees
DataFunTalk
DataFunTalk
May 18, 2022 · Big Data

Building and Optimizing JD Retail OLAP Platform: Architecture, Real‑time Updates, Materialized Views, and Join Optimization

This article presents JD Retail's OLAP platform construction and practical scenarios, covering control‑plane design, architecture, business management, operational safeguards, real‑time data updates, materialized view acceleration, join optimization techniques, high‑concurrency queries, and large‑scale write throughput for e‑commerce peak periods.

Big DataClickHouseOLAP
0 likes · 21 min read
Building and Optimizing JD Retail OLAP Platform: Architecture, Real‑time Updates, Materialized Views, and Join Optimization
DataFunSummit
DataFunSummit
May 17, 2022 · Information Security

Data Security Governance Practices and Frameworks: A Comprehensive Overview

This article presents a detailed overview of data security governance in China, covering policy milestones, major security incidents, current challenges, a three‑layer governance model, practical workflow steps, classification methods, emerging zero‑trust concepts, and real‑world case studies, offering actionable insights for organizations seeking robust data protection.

Big DataZero Trustdata security
0 likes · 11 min read
Data Security Governance Practices and Frameworks: A Comprehensive Overview
Big Data Technology & Architecture
Big Data Technology & Architecture
May 17, 2022 · Big Data

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

This article provides a comprehensive guide to Apache Hudi, covering its basic concepts, timeline architecture, storage types (Copy‑On‑Write and Merge‑On‑Read), write operations, DeltaStreamer usage, Hive/Spark/Presto query integration, data management, indexing, compaction, and best‑practice recommendations for big‑data lake workloads.

Apache HudiBig DataCopy-on-Write
0 likes · 43 min read
Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management
DataFunTalk
DataFunTalk
May 17, 2022 · Big Data

Exploring JuiceFS in Data Lake Storage Architecture

This presentation provides a comprehensive overview of JuiceFS, an open‑source cloud‑native distributed file system, detailing its role in modern data lake and lakehouse architectures, comparing it with HDFS and object storage, and highlighting its performance, integration, and community ecosystem.

Big DataData LakeDistributed File System
0 likes · 19 min read
Exploring JuiceFS in Data Lake Storage Architecture
DataFunSummit
DataFunSummit
May 15, 2022 · Databases

Design and Evolution of a Custom Storage Engine for IoT Device Metadata

This article presents a detailed case study of an IoT device metadata management platform, describing the business scenario, the evolution from a single‑node MySQL solution through sharded MySQL, HBase and Elasticsearch, to a self‑developed distributed storage engine that separates compute and storage, supports LSM, multi‑dimensional indexing, routing keys, and parallel scans to meet massive write‑read throughput and complex query requirements.

Big DataDistributed SystemsIoT
0 likes · 14 min read
Design and Evolution of a Custom Storage Engine for IoT Device Metadata
Big Data Technology & Architecture
Big Data Technology & Architecture
May 15, 2022 · Big Data

Understanding Flink Window Table-Valued Functions (TVF) and Incremental Optimization

This article explains the concept of window table-valued functions in Flink, compares the old grouped‑window syntax with the new TVF syntax, details the physical and execution plans, introduces sliced windows for state reduction, and presents a small incremental‑output improvement with code examples.

Big DataFlinkIncremental Aggregation
0 likes · 12 min read
Understanding Flink Window Table-Valued Functions (TVF) and Incremental Optimization
DataFunSummit
DataFunSummit
May 14, 2022 · Databases

Design of Cloud‑Native ClickHouse: Architecture, Storage‑Compute Separation, and MPP Query Layer

This article presents the cloud‑native redesign of ClickHouse, covering its current technical limitations in storage and computation, the proposed storage‑compute separation with DDL task management, multi‑replica and CommitLog mechanisms, and a new MPP query layer to meet future data‑warehouse demands such as real‑time analytics, flexibility, high throughput, low cost, and support for semi‑structured data.

Big DataClickHouseCloud Native
0 likes · 15 min read
Design of Cloud‑Native ClickHouse: Architecture, Storage‑Compute Separation, and MPP Query Layer
DaTaobao Tech
DaTaobao Tech
May 13, 2022 · Big Data

Taobao Big Data Model Governance and DataWorks Co‑development

Taobao’s rapidly expanding technical data system faced naming inconsistencies, low table reuse, and costly, inefficient data usage, prompting a joint effort with DataWorks to digitize model evaluation, enforce standardized governance, deliver intelligent end‑to‑end modeling tools, and launch a development assistant, resulting in a health‑monitoring dashboard, upgraded data maps, and a roadmap for further automation and architecture refinement.

Big DataData GovernanceData Platform
0 likes · 12 min read
Taobao Big Data Model Governance and DataWorks Co‑development
dbaplus Community
dbaplus Community
May 12, 2022 · Big Data

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

This article details Bilibili's end‑to‑end Presto on Hadoop architecture, covering the multi‑engine SQL stack, dispatcher routing, cluster scale, stability enhancements like coordinator HA and real‑time punish, query limits, Hive UDF compatibility, insert‑overwrite support, Alluxio caching, multi‑datacenter routing, query result caching, Raptorx local cache, JDK upgrades, dynamic filtering, and future roadmap, illustrating how these innovations boosted query throughput and reduced latency.

Big DataCluster ManagementDistributed Systems
0 likes · 32 min read
How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains
dbaplus Community
dbaplus Community
May 11, 2022 · Big Data

How JD Logistics Tackled Billion-Scale Data Challenges with Doris

This article details JD Logistics' journey from fragmented, massive‑scale data to a unified, real‑time analytics platform, covering business needs, pain points, tool evaluation, a new Doris‑based architecture, table management, data import procedures, automation scripts, and future roadmap for data engineering.

BI ToolsBig Datadata-warehouse
0 likes · 16 min read
How JD Logistics Tackled Billion-Scale Data Challenges with Doris
Baidu Geek Talk
Baidu Geek Talk
May 9, 2022 · Big Data

How a Spark Offline Framework Boosts Data Backtracking Efficiency

This article introduces a Spark offline development framework that separates configuration from code, supports SQL and Java applications, and provides fast, automated data backtracking with reduced environment preparation time, lower failure rates, and significant performance gains for large‑scale data warehouses.

Big DataData BacktrackingOffline Framework
0 likes · 17 min read
How a Spark Offline Framework Boosts Data Backtracking Efficiency
StarRocks
StarRocks
May 7, 2022 · Databases

How 360 Built a Lightning‑Fast Unified Analytics Platform with StarRocks

Facing massive data storage and query challenges, 360 upgraded its analytics architecture by adopting StarRocks, achieving multi‑dimensional, high‑concurrency analysis, simplified data pipelines, and significant performance and cost improvements across its radar and user‑portrait platforms.

AnalyticsBig DataOLAP
0 likes · 10 min read
How 360 Built a Lightning‑Fast Unified Analytics Platform with StarRocks
58 Tech
58 Tech
May 5, 2022 · Big Data

Low-Code Real-Time Data Warehouse Construction System Using Flink

This article describes a low‑code, Flink‑based real‑time data‑warehouse construction system that abstracts the warehouse building process into ODS, DWD, DWS, and ADS layers, leverages a domain‑specific language and plugin engine to reduce development effort, and details its architecture, DSL design, plugin extensibility, dimension‑table completion, stream merging, aggregation, and storage strategies.

Big DataDSLFlink
0 likes · 11 min read
Low-Code Real-Time Data Warehouse Construction System Using Flink
Big Data Technology & Architecture
Big Data Technology & Architecture
May 4, 2022 · Big Data

Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities

The Apache Hudi 0.11.0 release introduces multi‑mode metadata indexing, enhanced data‑skipping, asynchronous indexing, extensive Spark and Flink integration improvements, new bundle utilities, and expanded metadata synchronization with BigQuery, AWS Glue, and DataHub, while also adding bucket indexing and encryption support.

Apache HudiAsync IndexBig Data
0 likes · 13 min read
Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities
DataFunSummit
DataFunSummit
Apr 29, 2022 · Big Data

Optimizing Query Performance in Apache Iceberg with Z‑Order Data Organization

This article explains how Apache Iceberg’s DataSkipping technique can lose efficiency when many filter columns are used, and presents a data‑organization optimization using space‑filling curves and Z‑Order to improve query I/O, details the OPTIMIZE implementation, and shares performance benchmark results and future plans.

Apache IcebergBig DataData Skipping
0 likes · 12 min read
Optimizing Query Performance in Apache Iceberg with Z‑Order Data Organization
ITPUB
ITPUB
Apr 27, 2022 · Databases

Mastering Data Warehouse Standards: Architecture, Layer Design, and Naming Conventions

This comprehensive guide explains data‑warehouse construction standards, covering model architecture principles, public development rules, layer‑by‑layer design specifications, and systematic naming conventions for tables, dimensions, and metrics to ensure consistency, scalability, and reliable data governance.

Big DataDatabase StandardsETL
0 likes · 26 min read
Mastering Data Warehouse Standards: Architecture, Layer Design, and Naming Conventions
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 26, 2022 · Big Data

ByteDance's Internal Presto OLAP Engine: Deployment, Performance Boosts, and Operational Practices

The article details ByteDance's large‑scale deployment of the Presto OLAP engine for ad‑hoc, BI, and near‑real‑time analytics, describing its architecture, multi‑coordinator high‑availability design, routing gateway, adaptive cancel, history server, materialized‑view support, Hudi connector integration, and how these innovations improve performance, stability, and operational efficiency.

Big DataHudi ConnectorMaterialized Views
0 likes · 11 min read
ByteDance's Internal Presto OLAP Engine: Deployment, Performance Boosts, and Operational Practices
Architects Research Society
Architects Research Society
Apr 25, 2022 · Artificial Intelligence

Reflecting on a Decade of Data Science and Implications for Future Visualization Tools

The article reviews a decade‑long growth of data science, defines its multidisciplinary nature, outlines the four high‑level and fourteen low‑level processes, describes nine distinct data‑science roles, and discusses how these insights can guide the design of next‑generation data‑visualization and analysis tools.

Big DataData ScienceRoles
0 likes · 10 min read
Reflecting on a Decade of Data Science and Implications for Future Visualization Tools
Bilibili Tech
Bilibili Tech
Apr 25, 2022 · Big Data

Optimizing Full Partition Tables with Zipper Tables, Hudi+Flink CDC, and Data Warehouse Strategies

Facing server‑hardware constraints, Bilibili’s data platform replaced wasteful full‑partition tables with a zipper‑table approach—preserving change history while cutting storage from petabytes to terabytes—and complemented it with Hudi + Flink CDC for near‑real‑time updates, dramatically lowering I/O, compute usage and latency.

Big DataFlink CDCHudi
0 likes · 11 min read
Optimizing Full Partition Tables with Zipper Tables, Hudi+Flink CDC, and Data Warehouse Strategies
Top Architect
Top Architect
Apr 23, 2022 · Big Data

Ensuring No Duplicate and No Loss in Baidu Log Middle Platform: Architecture, Challenges, and Solutions

This article explains the design, implementation, and future plans of Baidu's log middle platform, detailing its lifecycle management, service architecture, data reliability goals of eliminating duplication and loss, and the technical measures taken across SDKs, servers, and streaming pipelines to achieve near‑100% data integrity.

Backend ArchitectureBig DataData Reliability
0 likes · 15 min read
Ensuring No Duplicate and No Loss in Baidu Log Middle Platform: Architecture, Challenges, and Solutions
DataFunSummit
DataFunSummit
Apr 20, 2022 · Big Data

SuperSQL: A Cross‑Engine, Cross‑DC High‑Performance Big Data SQL Middleware

The article presents SuperSQL, a high‑performance big‑data SQL middleware that enables cross‑engine and cross‑data‑center query processing, detailing its architecture, metadata management, cost‑based optimization, operator push‑down, distributed execution, performance benchmarks, and future roadmap within modern data‑intensive environments.

Big DataCross-DCSQL Middleware
0 likes · 24 min read
SuperSQL: A Cross‑Engine, Cross‑DC High‑Performance Big Data SQL Middleware
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 20, 2022 · Big Data

Fine‑Grained Resource Management in Apache Flink: Scenarios, Mechanism, Efficiency, Allocation Strategies, and Limitations

This article explains Apache Flink's fine‑grained resource management, describing typical use cases, the slot‑based mechanism, how it improves resource efficiency, the default allocation strategy, current limitations, and provides example code for configuring slot sharing groups.

Apache FlinkBig DataFine-Grained Resource Management
0 likes · 12 min read
Fine‑Grained Resource Management in Apache Flink: Scenarios, Mechanism, Efficiency, Allocation Strategies, and Limitations
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Apr 19, 2022 · Cloud Native

How Volcano Is Shaping Cloud‑Native Batch Computing in CNCF

Volcano, the first cloud‑native batch computing project contributed by Huawei Cloud, has been promoted to a CNCF incubating project, signaling broad industry recognition and outlining a roadmap that includes cross‑cloud scheduling, GPU virtualization, fine‑grained resource management, and workflow orchestration for AI and big‑data workloads.

AIBatch ComputingBig Data
0 likes · 8 min read
How Volcano Is Shaping Cloud‑Native Batch Computing in CNCF
ITPUB
ITPUB
Apr 19, 2022 · Big Data

Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive

This article explains why modern enterprises need real‑time data‑warehouse architectures, breaks down traditional layered warehouse concepts, compares Lambda and Kappa models, evaluates five practical real‑time solutions—including Iceberg‑based lakehouse and MPP databases—provides code snippets, and offers selection guidance with real‑world company examples.

Big DataFlinkIceberg
0 likes · 19 min read
Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive
Architect
Architect
Apr 18, 2022 · Big Data

Ensuring Data Accuracy and Reliability in Baidu's Log Middle Platform

This article describes Baidu's log middle platform architecture, its data lifecycle management, integration status, terminology, service overview, core challenges of ensuring data accuracy, and the implemented optimizations for persistent storage, service decomposition, and SDK reporting to achieve near‑100% no‑repeat no‑loss reliability.

Backend ArchitectureBig DataData Reliability
0 likes · 15 min read
Ensuring Data Accuracy and Reliability in Baidu's Log Middle Platform
DeWu Technology
DeWu Technology
Apr 18, 2022 · Artificial Intelligence

Warehouse Storage Location Recommendation: Architecture, Recall, and Ranking Strategies

The article outlines DeWu’s warehouse‑management recommendation system, which combines an online‑near‑line‑offline architecture to quickly recall viable shelf slots and rank them by space utilization, travel time, and sales potential, enabling automated, constraint‑aware placement that cuts picking time and inventory costs.

AIBig DataStorage Optimization
0 likes · 16 min read
Warehouse Storage Location Recommendation: Architecture, Recall, and Ranking Strategies
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 18, 2022 · Big Data

Overview of Meituan Merchant Version Data Metrics System

This article provides a comprehensive overview of Meituan's merchant platform data metric system, detailing its five main modules—overview, business, traffic, customer, and product—along with competitor analysis and actionable insights for merchants to improve operations and growth.

Big DataData AnalyticsMeituan
0 likes · 11 min read
Overview of Meituan Merchant Version Data Metrics System
JavaEdge
JavaEdge
Apr 17, 2022 · Big Data

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

The article explains how Spark, developed by UC Berkeley's AMP Lab, quickly surpassed MapReduce by offering faster execution, a simpler Scala‑based programming model, lazy RDD transformations, a rich ecosystem including SQL, Streaming, MLlib and GraphX, and practical code examples such as a three‑line WordCount.

Big DataMapReduceRDD
0 likes · 7 min read
Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 15, 2022 · Big Data

Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query

This guide explains how to set up the Flink SQL client to work with Apache Iceberg, covering Scala version requirements, downloading and deploying Iceberg jars, configuring Hive and HDFS catalogs, creating databases and tables, performing insert and overwrite operations, and querying data in both batch and streaming modes.

Big DataCatalogFlink
0 likes · 18 min read
Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query
ByteDance Data Platform
ByteDance Data Platform
Apr 15, 2022 · Cloud Native

How ByteHouse Evolved From ClickHouse Into a Next‑Gen Cloud‑Native Data Warehouse

ByteHouse, born from ByteDance’s extensive use of ClickHouse, transformed a high‑performance OLAP engine into a cloud‑native, scalable data warehouse by addressing scalability, elasticity, high availability, and multi‑tenant challenges through architectural redesign, custom storage layers, and advanced metadata management.

Big DataByteHouseClickHouse
0 likes · 19 min read
How ByteHouse Evolved From ClickHouse Into a Next‑Gen Cloud‑Native Data Warehouse
vivo Internet Technology
vivo Internet Technology
Apr 13, 2022 · Big Data

Understanding Join Algorithms in Presto: Theory, Implementation, and Engineering Practices

The article explains Presto’s join processing by detailing the business need to limit multi‑table joins, then describing nested‑loop, sort‑merge, and hash join algorithms with Java examples, and finally showing how the Volcano model, columnar pages, and planner integration enable scalable, efficient OLAP join execution.

Big DataHash JoinJoin Algorithms
0 likes · 17 min read
Understanding Join Algorithms in Presto: Theory, Implementation, and Engineering Practices
Zuoyebang Tech Team
Zuoyebang Tech Team
Apr 13, 2022 · Big Data

How Delta Lake Transformed Our Offline Data Warehouse Performance

This article details how ZuoYeBang's engineering team migrated their Hive‑based offline data warehouse to Delta Lake, tackling latency, scalability, and query‑performance challenges through stream‑to‑batch processing, data‑lake architecture, and optimizations like DPP and Z‑ordering.

Big DataDelta LakePresto
0 likes · 15 min read
How Delta Lake Transformed Our Offline Data Warehouse Performance
Cloud Native Technology Community
Cloud Native Technology Community
Apr 13, 2022 · Big Data

Introduction to ClickHouse: Features, Architecture, Installation, Data Types, and Cluster Deployment

This article provides a comprehensive overview of ClickHouse, an open‑source column‑oriented MPP analytical database, covering its advantages and drawbacks, key features, typical use cases, data access flow, installation steps, core directories, indexes, data types, database and table engines, as well as detailed cluster architecture and deployment patterns.

Big DataClickHouseCluster
0 likes · 29 min read
Introduction to ClickHouse: Features, Architecture, Installation, Data Types, and Cluster Deployment
StarRocks
StarRocks
Apr 13, 2022 · Big Data

How StarRocks Achieves Lightning‑Fast Data Lake Analytics

This article explains StarRocks' streamlined architecture, cost‑based optimizer, massively parallel processing and vectorized engine, and how they enable high‑performance queries over data stored in Hive, Iceberg, Hudi and other lake formats, backed by benchmark results and future roadmap details.

Big DataCBOData Lake
0 likes · 19 min read
How StarRocks Achieves Lightning‑Fast Data Lake Analytics
DataFunTalk
DataFunTalk
Apr 13, 2022 · Databases

Adopting StarRocks for Real‑Time Analytics in ZhongAn’s JiZhi Platform: A Performance Comparison with ClickHouse

This article describes how ZhongAn Insurance’s JiZhi data‑analysis platform migrated from ClickHouse to the MPP OLAP engine StarRocks, detailing the business requirements, architectural challenges, benchmark results across single‑table and multi‑table queries, and the resulting improvements in latency, concurrency, and operational simplicity for real‑time analytics.

Big DataClickHouseOLAP
0 likes · 14 min read
Adopting StarRocks for Real‑Time Analytics in ZhongAn’s JiZhi Platform: A Performance Comparison with ClickHouse
IT Services Circle
IT Services Circle
Apr 12, 2022 · Big Data

Finding Missing Unsigned Integers in a 4‑Billion‑Element File Using Interval Counting and Bitmap Technique

The article explains how to locate all missing 32‑bit unsigned integers in a 4 billion‑entry file by first partitioning the range into intervals, counting entries per interval with a tiny int[64] array, and then applying a bitmap method only to under‑filled intervals, achieving a memory footprint of just a few hundred bytes.

Big DataMemory Optimizationalgorithm
0 likes · 5 min read
Finding Missing Unsigned Integers in a 4‑Billion‑Element File Using Interval Counting and Bitmap Technique
High Availability Architecture
High Availability Architecture
Apr 11, 2022 · Big Data

Ensuring Data Accuracy and Reliability in Baidu Log Platform: Architecture, Challenges, and Solutions

This article introduces the current state of Baidu's log platform, explains its lifecycle from data collection to downstream applications, analyzes the challenges of achieving near‑zero duplication and loss, and presents architectural optimizations and best‑practice recommendations to improve data stability and accuracy across the system.

Big DataData ReliabilitySystem Architecture
0 likes · 19 min read
Ensuring Data Accuracy and Reliability in Baidu Log Platform: Architecture, Challenges, and Solutions
DataFunSummit
DataFunSummit
Apr 9, 2022 · Big Data

Impala Deployment and Optimization: Practical Experience with Sensor Data Multi‑dimensional Analysis Platform

This article presents a comprehensive technical walkthrough of Sensor Data's multi‑dimensional analysis platform, covering product architecture, an Impala‑based real‑time query engine, query performance tuning, resource‑estimation strategies, and future plans, with concrete diagrams, test results, and community contributions.

Big DataData ArchitectureImpala
0 likes · 19 min read
Impala Deployment and Optimization: Practical Experience with Sensor Data Multi‑dimensional Analysis Platform
DataFunTalk
DataFunTalk
Apr 9, 2022 · Big Data

Optimizing Apache Iceberg Query Performance with Z‑Order Data Organization

This talk explains how Apache Iceberg’s DataSkipping can lose efficiency with many filter columns, and presents a data‑organization redesign using space‑filling curves and Z‑Order to improve query I/O, detailing the OPTIMIZE syntax, implementation steps, performance benchmarks, and future roadmap.

Apache IcebergBig DataData Skipping
0 likes · 12 min read
Optimizing Apache Iceberg Query Performance with Z‑Order Data Organization
Bilibili Tech
Bilibili Tech
Apr 9, 2022 · Big Data

Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements

Bilibili’s Presto on Hadoop combines a multi‑engine offline platform with Kubernetes‑managed YARN scheduling, Ranger security, and a custom dispatcher, scaling to over 400 nodes handling 160 k daily queries on 10 PB, while adding coordinator HA, resource‑group punishment, query limits, Alluxio caching, dynamic filtering, and numerous SQL‑level enhancements, with future auto‑scaling and materialized‑view automation.

Big DataHadoopPresto
0 likes · 30 min read
Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements
DataFunTalk
DataFunTalk
Apr 7, 2022 · Big Data

Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment

This article introduces Apache Kyuubi—a multi‑tenant Thrift JDBC/ODBC service built on Spark—detailing its architecture, advantages over Spark Thrift Server, real‑world use cases, open‑source community progress, and practical deployment strategies on mobile cloud, Kubernetes, and with Trino.

Apache SparkBig DataKyuubi
0 likes · 16 min read
Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment
DataFunSummit
DataFunSummit
Apr 6, 2022 · Big Data

Real-time Dimension Modeling with Flink SQL: Challenges and Solutions

This article presents a JD.com case study on applying Flink SQL for real‑time dimension modeling, detailing two complex streaming scenarios—full‑join of multiple streams and full‑group aggregation—along with the associated challenges of historical data handling, state management, and performance optimization, and proposes component‑based architectural solutions.

Big DataFlinkReal-Time
0 likes · 14 min read
Real-time Dimension Modeling with Flink SQL: Challenges and Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 5, 2022 · Big Data

Using ElasticsearchSink with Apache Flink: Configuration, Retry Strategies, and Failure Handling

This article introduces the ElasticsearchSink for Apache Flink, explains how to add Maven dependencies, implement the sink with configuration and retry settings, details failure handlers, and highlights important considerations such as exception handling and checkpoint requirements for reliable streaming pipelines.

Big DataElasticsearchFailure Handling
0 likes · 9 min read
Using ElasticsearchSink with Apache Flink: Configuration, Retry Strategies, and Failure Handling
DataFunTalk
DataFunTalk
Apr 4, 2022 · Big Data

Impala Deployment and Optimization in Sensors Data's Multi-Dimensional Analytics Platform

This article details the architecture of Sensors Data's analytics platform, the implementation of a real‑time Impala query engine, multiple query‑performance optimizations—including storage redesign, user‑behavior sequence tuning, join elimination and expression push‑down—and a resource‑estimation framework that dramatically reduces query failures and latency.

Big DataData PlatformImpala
0 likes · 16 min read
Impala Deployment and Optimization in Sensors Data's Multi-Dimensional Analytics Platform
DataFunTalk
DataFunTalk
Apr 2, 2022 · Big Data

SuperSQL: A High‑Performance Cross‑Engine, Cross‑Data‑Center SQL Middleware for Big Data

The article introduces SuperSQL, a federated SQL middleware that unifies heterogeneous data sources across multiple data centers, leverages Apache Calcite for cost‑based optimization, pushes down operators to various engines, manages metadata with a Trie model, and demonstrates significant performance gains over traditional solutions.

Big DataCross‑Data‑CenterSQL Middleware
0 likes · 27 min read
SuperSQL: A High‑Performance Cross‑Engine, Cross‑Data‑Center SQL Middleware for Big Data
DataFunTalk
DataFunTalk
Apr 1, 2022 · Operations

Integrated Digital Supply Chain: JD Logistics' Intelligent Planning, Algorithm Platform, and Digital Twin Practices

This article explores JD Logistics' integrated digital supply chain, detailing its evolution, the construction of an algorithm middle‑platform, engineering platforms, digital twin system, real‑world case studies, and future talent and ecosystem directions, illustrating how AI and big‑data technologies drive end‑to‑end logistics optimization.

AI OptimizationAlgorithm PlatformBig Data
0 likes · 16 min read
Integrated Digital Supply Chain: JD Logistics' Intelligent Planning, Algorithm Platform, and Digital Twin Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 31, 2022 · Big Data

Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg

To address the high cost and low efficiency of traditional Hadoop‑based data pipelines, Bilibili designed a lakehouse solution using Apache Iceberg, integrating Spark, Flink, Trino, and Alluxio to unify flexible data lake storage with warehouse‑level query performance, reducing data duplication and improving interactive analytics.

Big DataIcebergLakehouse
0 likes · 17 min read
Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg
DataFunTalk
DataFunTalk
Mar 30, 2022 · Big Data

NetEase Big Data Platform: HDFS Optimization and Practice

This article presents NetEase's big data platform architecture, detailing multi‑layer storage and compute design, HDFS deployment challenges, NameNode and NameSpace performance optimizations, cluster scaling strategies, data tiering, hardware upgrades, and real‑world business use cases, illustrating practical large‑scale big data engineering.

Big DataCluster OptimizationData Management
0 likes · 23 min read
NetEase Big Data Platform: HDFS Optimization and Practice
21CTO
21CTO
Mar 30, 2022 · Big Data

What Drives Taobao App Users? Insights from AARRR and RFM Analyses

This article analyzes 2 million Taobao app user‑behavior records using AARRR funnel metrics and RFM segmentation, revealing daily and hourly usage patterns, conversion bottlenecks, product‑search mismatches, and offering data‑driven marketing recommendations to boost retention and sales.

AARRRBig DataRFM
0 likes · 25 min read
What Drives Taobao App Users? Insights from AARRR and RFM Analyses
Bilibili Tech
Bilibili Tech
Mar 30, 2022 · Big Data

HDFS Architecture, Optimizations, and Future Plans at Bilibili

Bilibili’s HDFS now runs a three‑tier architecture—access, metadata, and data layers—enhanced with a custom MergeFS router, observer NameNode, dynamic load balancing, fast‑failover pipelines, and storage‑aware policies, while future work targets transparent erasure coding, tiered data routing, lock refinements, and a Hadoop 3.x migration.

Big DataDistributed File SystemHDFS
0 likes · 22 min read
HDFS Architecture, Optimizations, and Future Plans at Bilibili
Efficient Ops
Efficient Ops
Mar 29, 2022 · Big Data

How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations

This article explains how Tencent Cloud's APM metric calculation, which transforms massive Span data into aggregated metrics using Flink, faced performance bottlenecks and was optimized through job splitting, batch merging, and dimension pruning, ultimately achieving a 2‑3× speed increase and cutting resource usage to about 30% of the original.

APMBig DataFlink
0 likes · 10 min read
How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations
DataFunTalk
DataFunTalk
Mar 29, 2022 · Big Data

FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements

This article introduces the FlinkX framework for multi‑source heterogeneous data synchronization, detailing its background, core functions such as checkpoint‑based resume, metric monitoring, rate limiting, plugin architecture, cloud‑native K8s deployment, Hudi integration, and future roadmap, while also addressing common Q&A topics.

BatchBig DataData Lake
0 likes · 14 min read
FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements
58 Tech
58 Tech
Mar 29, 2022 · Big Data

Design and Implementation of the 58 Group Penalty Data Center

This article presents the design, architecture, and implementation of a unified penalty data center for 58 Group, detailing the challenges of heterogeneous data sources, the selection of Flink for real‑time ETL, the use of a DSL and LRU aggregation, and the adoption of MVEL for feature recognition to achieve standardized, high‑performance penalty data processing.

Big DataETLFlink
0 likes · 13 min read
Design and Implementation of the 58 Group Penalty Data Center
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 28, 2022 · Big Data

Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents JD's real-time dimension modeling case using Flink SQL, detailing two complex streaming scenarios, the difficulties of handling historical data and state management, and a component‑based solution that leverages external KV stores and optimized Flink operators to improve performance and scalability.

Big DataFlinkReal-Time
0 likes · 13 min read
Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions
Architects' Tech Alliance
Architects' Tech Alliance
Mar 28, 2022 · Artificial Intelligence

Digital Twin: Ten Fundamental Questions and Insights for Researchers, Decision‑Makers, and Practitioners

This article analyzes ten fundamental questions about digital twins, covering definitions, stakeholders, global interest, relationship with smart manufacturing, integration with New IT, scientific challenges, standards, and commercial tools, aiming to guide researchers, policymakers, and practitioners in understanding and applying digital twin technology.

AIBig DataDigital Twin
0 likes · 22 min read
Digital Twin: Ten Fundamental Questions and Insights for Researchers, Decision‑Makers, and Practitioners
Bilibili Tech
Bilibili Tech
Mar 25, 2022 · Big Data

Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling

Bilibili transformed its YARN CapacityScheduler from a heartbeat‑driven design to a multi‑threaded global scheduler by separating lock handling, adopting Weighted Round‑Robin with DRF, adding batch node selection, fixing proposal inconsistencies, tuning GC and logging, and thereby reduced application allocation time by about 38 % on clusters of up to 8,000 nodes.

Big DataCapacitySchedulerHadoop
0 likes · 15 min read
Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling