Tagged articles

3675 articles

Page 17 of 37

Jun 15, 2022 · Big Data

Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges

The article proposes replacing the traditional multi‑layered data‑warehouse architecture (ODS‑DWD‑DWS‑ADS) with a single, column‑store wide‑table per business theme, achieving roughly 30 % storage savings and faster queries, while acknowledging higher ETL complexity, back‑tracking costs, and production timing challenges.

Big DataETLParquet

0 likes · 11 min read

Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges

Big Data Technology Architecture

Jun 15, 2022 · Big Data

Building Real-Time Data Warehouses with Flink CDC and StarRocks: Architecture, Challenges, and Solutions

This article explains how to construct a real‑time data warehouse by combining Flink CDC for end‑to‑end change data capture with StarRocks' high‑performance OLAP engine, detailing the architectural challenges, optimization techniques, and a practical e‑commerce case study.

Big DataData WarehousingFlink CDC

0 likes · 16 min read

Building Real-Time Data Warehouses with Flink CDC and StarRocks: Architecture, Challenges, and Solutions

dbaplus Community

Jun 14, 2022 · Big Data

How Qunar Built a Scalable BI Platform for Real‑Time Analytics and Self‑Service Reporting

This article details Qunar's multi‑year journey of designing and evolving a full‑stack BI platform—covering data ingestion, storage, query engines, self‑service analytics, and real‑time OLAP—by iterating through three development phases, selecting technologies such as Impala, Kudu, ClickHouse and Apache Druid, and addressing performance, usability and governance challenges to empower business users with fast, reliable data insights.

Apache DruidBIBig Data

0 likes · 24 min read

How Qunar Built a Scalable BI Platform for Real‑Time Analytics and Self‑Service Reporting

Alibaba Cloud Developer

Jun 14, 2022 · Big Data

Can a Streaming Data Warehouse Balance Freshness, Latency, and Cost?

This article examines the core trade‑offs of data warehouses—freshness, query latency, and cost—compares offline and real‑time architectures, introduces the concept of a streaming data warehouse, and details how Apache Flink Table Store aims to provide a unified, low‑cost solution.

Big DataFlinkStreaming

0 likes · 19 min read

Can a Streaming Data Warehouse Balance Freshness, Latency, and Cost?

AntTech

Jun 14, 2022 · Big Data

Insights on Graph Computing: Technology, Applications, and Future Directions

Professor Chen Wenguang discusses how graph computing—originating from graph theory—offers a powerful way to model relationships across industries, its rapid development in China, challenges in scaling, integration with AI via graph neural networks, and the collaborative efforts needed between academia and industry to advance the field.

AIBig DataGraph Processing

0 likes · 17 min read

Insights on Graph Computing: Technology, Applications, and Future Directions

DeWu Technology

Jun 13, 2022 · Operations

How to Build a Minute‑Level Order Fulfillment Simulation Platform with DataWorks

This article outlines the design and implementation of a minute‑level order‑fulfillment timeliness simulation platform, detailing its background, objectives, challenges, architecture built on Alibaba Cloud DataWorks, core workflow nodes, domain model, ER diagram, JSON task templates, and future extensions for supply‑chain routing.

Big DataDataWorksarchitecture

0 likes · 11 min read

How to Build a Minute‑Level Order Fulfillment Simulation Platform with DataWorks

NetEase Game Operations Platform

Jun 10, 2022 · Databases

Apache Doris Deployment and Optimization at NetEase Interactive Entertainment

This article details NetEase Interactive Entertainment's adoption of Apache Doris for large‑scale game data analytics, covering background, Doris architecture, cluster governance, tablet and compaction tuning, scaling strategies, monitoring, alerting, and fault‑handling practices to improve performance and stability.

Apache DorisBig DataCluster Management

0 likes · 22 min read

Apache Doris Deployment and Optimization at NetEase Interactive Entertainment

Alibaba Terminal Technology

Jun 10, 2022 · Big Data

Build a Raspberry Pi Temp‑Humidity Dashboard Using Alibaba Cloud Big Data

This guide walks through creating a home temperature‑humidity monitoring system with a Raspberry Pi and DHT11 sensor, collecting data via JavaScript, sending it to Alibaba Cloud SLS, processing it through DataWorks and MaxCompute, and visualizing the results in QuickBI dashboards.

Alibaba CloudBig DataDHT11

0 likes · 11 min read

Build a Raspberry Pi Temp‑Humidity Dashboard Using Alibaba Cloud Big Data

Big Data Technology & Architecture

Jun 8, 2022 · Databases

Understanding ClickHouse Block and LSM: Batch Processing, Pre‑sorting, and Compression

The article explains how ClickHouse uses block‑based batch processing combined with LSM‑style pre‑sorting and columnar compression to accelerate range queries on massive datasets, while also discussing the trade‑offs such as write latency and limitations for transactional workloads.

Big DataBlockClickHouse

0 likes · 14 min read

Understanding ClickHouse Block and LSM: Batch Processing, Pre‑sorting, and Compression

DataFunTalk

Jun 6, 2022 · Big Data

Understanding Flink's Exactly-Once Guarantees: Checkpoint, Two‑Phase Commit, and Kafka Integration

This article explains how Apache Flink achieves end‑to‑end exactly‑once semantics by using source replay support, checkpoint‑based snapshots, asynchronous incremental checkpoints, and two‑phase commit sinks, and describes the interaction with external systems such as Kafka to ensure transactional writes.

Big DataCheckpointExactly-Once

0 likes · 7 min read

Understanding Flink's Exactly-Once Guarantees: Checkpoint, Two‑Phase Commit, and Kafka Integration

IT Architects Alliance

Jun 5, 2022 · Big Data

Understanding the Data Middle Platform: Definition, Benefits, and Implementation Principles

The article explains what a data middle platform is, why enterprises need it for big‑data‑driven decision making, AI, and operational efficiency, and outlines four key reasons and practical principles for building and evolving such a platform within modern digital organizations.

Artificial IntelligenceBig DataData Platform

0 likes · 9 min read

Understanding the Data Middle Platform: Definition, Benefits, and Implementation Principles

DataFunTalk

Jun 5, 2022 · Big Data

JD Big Data Platform: Cross‑Region and Tiered Storage Architecture and Practices

This article presents JD's large‑scale big‑data platform, detailing its overall architecture, the challenges of cross‑region storage, the design of a unified cross‑domain data synchronization mechanism, and the implementation of tiered storage to improve performance, cost efficiency, and data reliability across multi‑datacenter clusters.

Big DataData PlatformHDFS

0 likes · 15 min read

JD Big Data Platform: Cross‑Region and Tiered Storage Architecture and Practices

DataFunSummit

Jun 3, 2022 · Big Data

Building and Optimizing JD Retail OLAP Platform: Architecture, Management, and Performance Techniques

This article details JD Retail's OLAP platform construction, covering control plane design, architecture, business and operation management, real‑time data updates, materialized view usage, join optimizations, high‑concurrency and high‑throughput scenarios, and promotional preparation strategies, illustrated with diagrams and performance metrics.

Big DataClickHouseDistributed Systems

0 likes · 20 min read

Building and Optimizing JD Retail OLAP Platform: Architecture, Management, and Performance Techniques

IT Architects Alliance

Jun 1, 2022 · Big Data

Kafka Core Concepts: Producers, Consumers, Topics, Partitions, and Architecture

This article explains the fundamental concepts of Apache Kafka, covering its role as a streaming platform, the producer‑consumer model, how topics and partitions work, consumer groups for load balancing, message ordering, replication with leaders and followers, and the coordination role of ZooKeeper.

Big DataConsumerKafka

0 likes · 5 min read

Kafka Core Concepts: Producers, Consumers, Topics, Partitions, and Architecture

vivo Internet Technology

May 31, 2022 · Big Data

Kafka Load Balancing and Cruise Control: Concepts, Manual Migration, and Deployment

Kafka’s server‑side load imbalance, caused by static replica placement on broker disks, makes manual replica migration infeasible at scale, but Cruise Control automates metric collection, analysis, and execution of fine‑grained rebalance plans—including broker de‑commissioning and leader dispersion—allowing large clusters to expand and operate efficiently.

Big DataCluster ManagementCruise Control

0 likes · 21 min read

Kafka Load Balancing and Cruise Control: Concepts, Manual Migration, and Deployment

Tencent Cloud Developer

May 31, 2022 · Artificial Intelligence

Scalable Graph Neural Architecture Search System (PaSca) – WWW 2022 Best Student Paper

PaSca, a scalable graph neural architecture search system that separates message aggregation from updates, explores over 150,000 GNN designs with multi‑objective optimization, delivers models that outperform traditional GNNs in accuracy, memory and speed, has been open‑sourced and deployed at Tencent for risk control, recommendation and fraud detection, and earned the WWW 2022 Best Student Paper award.

Big DataNeural Architecture SearchScalable Systems

0 likes · 11 min read

Scalable Graph Neural Architecture Search System (PaSca) – WWW 2022 Best Student Paper

Big Data Technology & Architecture

May 31, 2022 · Databases

Vectorization and Roaring Bitmap Techniques in Database Query Execution

This article explains how classic SQL execution engines use the volcano model and expression trees, discusses their performance drawbacks, introduces vectorized execution to reduce overhead, and describes Roaring Bitmap compression methods with container types for efficient storage and processing of integer sets.

Big DataDatabase EngineOperator Tree

0 likes · 10 min read

Vectorization and Roaring Bitmap Techniques in Database Query Execution

Bilibili Tech

May 31, 2022 · Big Data

Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices

Bilibili migrated its massive offline platform from Hive to Spark using an automated SQL rewrite and dual‑run verification, cutting execution time over 40% and resource use 30%, while introducing small‑file merging, shuffle stability, runtime filters, data‑skipping, lineage tracking, auto‑parameter tuning, and metastore federation for robust large‑scale processing.

Big DataSparkdata engineering

0 likes · 30 min read

Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices

DataFunSummit

May 30, 2022 · Big Data

Lakehouse Architecture at Bilibili: Query Acceleration and Index Enhancement Practices

This article explains Bilibili's lake‑warehouse integrated architecture, describing how Iceberg, Z‑Order sorting, and advanced indexing techniques such as BloomFilter and BitMap are used to accelerate queries and improve data organization in large‑scale analytics workloads.

Big DataData LakeIceberg

0 likes · 18 min read

Lakehouse Architecture at Bilibili: Query Acceleration and Index Enhancement Practices

Volcano Engine Developer Services

May 30, 2022 · Databases

How ByteDance Scaled to 10 EB: Evolution of Its Cloud‑Native Database Architecture

This article chronicles ByteDance's journey from early MySQL‑based databases to a sophisticated, cloud‑native, distributed database platform that now supports over 10 EB of storage, detailing the challenges, architectural milestones, and future directions of its database infrastructure.

Big DataDistributed SystemsScalability

0 likes · 17 min read

How ByteDance Scaled to 10 EB: Evolution of Its Cloud‑Native Database Architecture

Big Data Technology & Architecture

May 30, 2022 · Big Data

Doris Architecture, Principles, and Key Features Overview

This article provides a comprehensive overview of Doris's architecture—including its FE and BE components, metadata management, data organization, execution planning—and details its major features such as adaptive join aggregation, vectorized execution, materialized views, and Elasticsearch integration, supplemented with example DDL and query code.

Big DataDatabase ArchitectureElasticsearch

0 likes · 7 min read

Doris Architecture, Principles, and Key Features Overview

Architect's Tech Stack

May 28, 2022 · Big Data

Data Lake Challenges and the Open SPL Computing Engine

The article examines the inherent trade‑offs of data lakes—maintaining raw data, enabling efficient computation, and keeping costs low—explains why traditional data‑warehouse approaches fall short, and introduces the open‑source SPL engine that provides multi‑source, file‑based, high‑performance analytics to overcome these limitations.

Big DataData LakeETL

0 likes · 12 min read

Data Lake Challenges and the Open SPL Computing Engine

Java High-Performance Architecture

May 26, 2022 · Big Data

Processing 10 GB Age Data on a 4 GB PC: Java Multithreaded Solution

This article demonstrates how to generate, read, and analyze a 10 GB file containing age statistics on a machine with only 4 GB RAM, comparing single‑threaded and multithreaded Java implementations, measuring performance, memory usage, and addressing GC and concurrency challenges.

Big DataFile I/Ojava

0 likes · 21 min read

Processing 10 GB Age Data on a 4 GB PC: Java Multithreaded Solution

IT Architects Alliance

May 25, 2022 · Big Data

Processing 10GB Age Data on a 4GB PC: Single‑Thread vs Multi‑Thread Solutions

This article walks through generating a 10GB file of age data, reading it line‑by‑line on a machine with only 4GB RAM, and compares a single‑thread counting approach with a multithreaded producer‑consumer design, showing performance gains, memory usage, and practical tips.

Big DataFile I/OPerformance

0 likes · 18 min read

Processing 10GB Age Data on a 4GB PC: Single‑Thread vs Multi‑Thread Solutions

vivo Internet Technology

May 25, 2022 · Big Data

Understanding Druid Metadata Management and Architecture

Apache Druid manages metadata through a layered, distributed system where the Overlord coordinates ingestion tasks, MiddleManagers launch Peons to create segments, Coordinators and Historical nodes store and serve segment data, Brokers route queries, while MySQL, Zookeeper, memory, and local files synchronize metadata for fault‑tolerant, high‑performance OLAP analytics.

Big DataDruidMetadata Management

0 likes · 19 min read

Architect

May 25, 2022 · Big Data

Metadata Infrastructure and Governance in Bilibili's Data Platform

The article details how Bilibili built a unified metadata infrastructure—including a URN‑based model, collection pipelines, quality assurance, storage in TiDB/ES/HugeGraph, and query services—to support data discovery, lineage, impact analysis, and governance across its growing data platform.

Big DataData CatalogData Governance

0 likes · 21 min read

Metadata Infrastructure and Governance in Bilibili's Data Platform

DataFunTalk

May 24, 2022 · Big Data

Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake

This article explains how Apache Flink integrates with Apache Hudi to enable real‑time data lake ingestion, covering the evolution from traditional data warehouses to data lakes, Hudi’s core concepts such as timeline and file grouping, copy‑on‑write vs merge‑on‑read modes, and Flink’s CDC‑based ETL pipeline.

Big DataCDCData Lake

0 likes · 18 min read

Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake

21CTO

May 23, 2022 · Big Data

What Walmart’s Beer‑and‑Diaper Insight Reveals About Big Data and Statistics

An amusing Walmart story about beer and diapers illustrates how big‑data analysis uncovers hidden consumer patterns, leading to targeted promotions, while the article expands on why statistics remains essential in the data‑science era, the challenges of learning it, and recommends a comprehensive R‑based statistics guide.

Big DataLearning ResourcesR language

0 likes · 6 min read

What Walmart’s Beer‑and‑Diaper Insight Reveals About Big Data and Statistics

Baidu Geek Talk

May 23, 2022 · Industry Insights

How Baidu Scales Real-Time Content Safety for Millions of Mini‑Programs

This article explains Baidu's evolving inspection scheduling system for its smart mini‑programs, detailing the challenges of massive page volumes, the V1.0 offline architecture, the V2.0 real‑time enhancements, resource constraints, deduplication logic, and the measurable improvements in risk detection and ecosystem health.

Big DataCloud ComputingContent Safety

0 likes · 17 min read

How Baidu Scales Real-Time Content Safety for Millions of Mini‑Programs

DataFunTalk

May 21, 2022 · Big Data

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

This talk presents Xiaomi's design and deployment of an elastic scheduling system for Hadoop YARN, covering background analysis, resource‑pool strategy, auto‑scaling architecture, stability challenges, label‑based resource isolation, Spark shuffle handling, cost‑saving results and future plans.

Big DataHadoopResource Management

0 likes · 16 min read

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

DataFunTalk

May 19, 2022 · Big Data

SeaTunnel: Distributed Data Integration Platform and Its Application in Traffic Management

This article introduces Apache SeaTunnel, a distributed, high‑performance data integration platform built on Spark and Flink, outlines its technical features, workflow, and plugin ecosystem, and details a concrete traffic‑management use case involving incremental Oracle‑to‑warehouse data synchronization with Spark resources and scheduled shell scripts.

Apache FlinkApache SparkBig Data

0 likes · 12 min read

SeaTunnel: Distributed Data Integration Platform and Its Application in Traffic Management

IT Architects Alliance

May 19, 2022 · Big Data

How Apache Kylin Enables Sub‑Second OLAP on Massive Data Sets

Apache Kylin leverages pre‑computed OLAP cubes on Hadoop/Spark/Flink to deliver sub‑second query responses for massive datasets, detailing its architecture, integration with BI platforms, user security, cube building, monitoring, and storage using HBase, illustrating how it overcomes big‑data analytical challenges.

Apache KylinBig DataHBase

0 likes · 12 min read

How Apache Kylin Enables Sub‑Second OLAP on Massive Data Sets

Big Data Technology & Architecture

May 18, 2022 · Databases

Understanding ClickHouse Distributed JOIN Implementation and Best Practices

This article explains ClickHouse's single‑node and distributed JOIN mechanisms, compares ordinary, GLOBAL, Broadcast, Shuffle and Colocate JOINs, illustrates execution flows with code examples, and provides practical recommendations to reduce join size, avoid query amplification, and leverage data pre‑distribution for optimal performance.

Big DataClickHousePerformance

0 likes · 10 min read

Understanding ClickHouse Distributed JOIN Implementation and Best Practices

Alibaba Cloud Developer

May 18, 2022 · Big Data

Why Delta Lake Is Revolutionizing Data Lakes with ACID Guarantees

This article explains how Delta Lake adds reliability to data lakes by offering ACID transactions, scalable metadata, and unified batch‑and‑stream processing, outlines the challenges it solves, details its implementation principles, and demonstrates a practical demo for building an integrated data warehouse.

ACIDBig DataData Lake

0 likes · 9 min read

Why Delta Lake Is Revolutionizing Data Lakes with ACID Guarantees

DataFunTalk

May 18, 2022 · Big Data

Building and Optimizing JD Retail OLAP Platform: Architecture, Real‑time Updates, Materialized Views, and Join Optimization

This article presents JD Retail's OLAP platform construction and practical scenarios, covering control‑plane design, architecture, business management, operational safeguards, real‑time data updates, materialized view acceleration, join optimization techniques, high‑concurrency queries, and large‑scale write throughput for e‑commerce peak periods.

Big DataClickHouseOLAP

0 likes · 21 min read

Building and Optimizing JD Retail OLAP Platform: Architecture, Real‑time Updates, Materialized Views, and Join Optimization

DataFunSummit

May 17, 2022 · Information Security

Data Security Governance Practices and Frameworks: A Comprehensive Overview

This article presents a detailed overview of data security governance in China, covering policy milestones, major security incidents, current challenges, a three‑layer governance model, practical workflow steps, classification methods, emerging zero‑trust concepts, and real‑world case studies, offering actionable insights for organizations seeking robust data protection.

Big DataZero Trustdata security

0 likes · 11 min read

Data Security Governance Practices and Frameworks: A Comprehensive Overview

Big Data Technology & Architecture

May 17, 2022 · Big Data

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

This article provides a comprehensive guide to Apache Hudi, covering its basic concepts, timeline architecture, storage types (Copy‑On‑Write and Merge‑On‑Read), write operations, DeltaStreamer usage, Hive/Spark/Presto query integration, data management, indexing, compaction, and best‑practice recommendations for big‑data lake workloads.

Apache HudiBig DataCopy-on-Write

0 likes · 43 min read

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

DataFunTalk

May 17, 2022 · Big Data

Exploring JuiceFS in Data Lake Storage Architecture

This presentation provides a comprehensive overview of JuiceFS, an open‑source cloud‑native distributed file system, detailing its role in modern data lake and lakehouse architectures, comparing it with HDFS and object storage, and highlighting its performance, integration, and community ecosystem.

Big DataData LakeDistributed File System

0 likes · 19 min read

Exploring JuiceFS in Data Lake Storage Architecture

DataFunSummit

May 15, 2022 · Databases

Design and Evolution of a Custom Storage Engine for IoT Device Metadata

This article presents a detailed case study of an IoT device metadata management platform, describing the business scenario, the evolution from a single‑node MySQL solution through sharded MySQL, HBase and Elasticsearch, to a self‑developed distributed storage engine that separates compute and storage, supports LSM, multi‑dimensional indexing, routing keys, and parallel scans to meet massive write‑read throughput and complex query requirements.

Big DataDistributed SystemsIoT

0 likes · 14 min read

Design and Evolution of a Custom Storage Engine for IoT Device Metadata

Big Data Technology & Architecture

May 15, 2022 · Big Data

Understanding Flink Window Table-Valued Functions (TVF) and Incremental Optimization

This article explains the concept of window table-valued functions in Flink, compares the old grouped‑window syntax with the new TVF syntax, details the physical and execution plans, introduces sliced windows for state reduction, and presents a small incremental‑output improvement with code examples.

Big DataFlinkIncremental Aggregation

0 likes · 12 min read

Understanding Flink Window Table-Valued Functions (TVF) and Incremental Optimization

DataFunSummit

May 14, 2022 · Databases

Design of Cloud‑Native ClickHouse: Architecture, Storage‑Compute Separation, and MPP Query Layer

This article presents the cloud‑native redesign of ClickHouse, covering its current technical limitations in storage and computation, the proposed storage‑compute separation with DDL task management, multi‑replica and CommitLog mechanisms, and a new MPP query layer to meet future data‑warehouse demands such as real‑time analytics, flexibility, high throughput, low cost, and support for semi‑structured data.

Big DataClickHouseCloud Native

0 likes · 15 min read

Design of Cloud‑Native ClickHouse: Architecture, Storage‑Compute Separation, and MPP Query Layer

DaTaobao Tech

May 13, 2022 · Big Data

Taobao Big Data Model Governance and DataWorks Co‑development

Taobao’s rapidly expanding technical data system faced naming inconsistencies, low table reuse, and costly, inefficient data usage, prompting a joint effort with DataWorks to digitize model evaluation, enforce standardized governance, deliver intelligent end‑to‑end modeling tools, and launch a development assistant, resulting in a health‑monitoring dashboard, upgraded data maps, and a roadmap for further automation and architecture refinement.

Big DataData GovernanceData Platform

0 likes · 12 min read

Taobao Big Data Model Governance and DataWorks Co‑development

dbaplus Community

May 12, 2022 · Big Data

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

This article details Bilibili's end‑to‑end Presto on Hadoop architecture, covering the multi‑engine SQL stack, dispatcher routing, cluster scale, stability enhancements like coordinator HA and real‑time punish, query limits, Hive UDF compatibility, insert‑overwrite support, Alluxio caching, multi‑datacenter routing, query result caching, Raptorx local cache, JDK upgrades, dynamic filtering, and future roadmap, illustrating how these innovations boosted query throughput and reduced latency.

Big DataCluster ManagementDistributed Systems

0 likes · 32 min read

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

dbaplus Community

May 11, 2022 · Big Data

How JD Logistics Tackled Billion-Scale Data Challenges with Doris

This article details JD Logistics' journey from fragmented, massive‑scale data to a unified, real‑time analytics platform, covering business needs, pain points, tool evaluation, a new Doris‑based architecture, table management, data import procedures, automation scripts, and future roadmap for data engineering.

BI ToolsBig Datadata-warehouse

0 likes · 16 min read

How JD Logistics Tackled Billion-Scale Data Challenges with Doris

vivo Internet Technology

May 11, 2022 · Big Data

How We Rolled Out a Massive HDFS 2.6→3.1 Upgrade on a 10,000‑Node Cluster

This article details the end‑to‑end process of migrating a 10,000‑node offline data‑warehouse from CDH 5.14.4 (HDFS 2.6.0) to HDP 3.1.4 (HDFS 3.1.1), covering version selection, rolling‑upgrade strategy, incompatibility fixes, client handling, tool coexistence, testing, automation, and lessons learned.

Big DataCluster MigrationHDFS

0 likes · 25 min read

How We Rolled Out a Massive HDFS 2.6→3.1 Upgrade on a 10,000‑Node Cluster

Baidu Geek Talk

May 9, 2022 · Big Data

How a Spark Offline Framework Boosts Data Backtracking Efficiency

This article introduces a Spark offline development framework that separates configuration from code, supports SQL and Java applications, and provides fast, automated data backtracking with reduced environment preparation time, lower failure rates, and significant performance gains for large‑scale data warehouses.

Big DataData BacktrackingOffline Framework

0 likes · 17 min read

How a Spark Offline Framework Boosts Data Backtracking Efficiency

StarRocks

May 7, 2022 · Databases

How 360 Built a Lightning‑Fast Unified Analytics Platform with StarRocks

Facing massive data storage and query challenges, 360 upgraded its analytics architecture by adopting StarRocks, achieving multi‑dimensional, high‑concurrency analysis, simplified data pipelines, and significant performance and cost improvements across its radar and user‑portrait platforms.

AnalyticsBig DataOLAP

0 likes · 10 min read

How 360 Built a Lightning‑Fast Unified Analytics Platform with StarRocks

58 Tech

May 5, 2022 · Big Data

Low-Code Real-Time Data Warehouse Construction System Using Flink

This article describes a low‑code, Flink‑based real‑time data‑warehouse construction system that abstracts the warehouse building process into ODS, DWD, DWS, and ADS layers, leverages a domain‑specific language and plugin engine to reduce development effort, and details its architecture, DSL design, plugin extensibility, dimension‑table completion, stream merging, aggregation, and storage strategies.

Big DataDSLFlink

0 likes · 11 min read

Low-Code Real-Time Data Warehouse Construction System Using Flink

Big Data Technology & Architecture

May 4, 2022 · Big Data

Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities

The Apache Hudi 0.11.0 release introduces multi‑mode metadata indexing, enhanced data‑skipping, asynchronous indexing, extensive Spark and Flink integration improvements, new bundle utilities, and expanded metadata synchronization with BigQuery, AWS Glue, and DataHub, while also adding bucket indexing and encryption support.

Apache HudiAsync IndexBig Data

0 likes · 13 min read

Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities

DataFunSummit

Apr 29, 2022 · Big Data

Optimizing Query Performance in Apache Iceberg with Z‑Order Data Organization

This article explains how Apache Iceberg’s DataSkipping technique can lose efficiency when many filter columns are used, and presents a data‑organization optimization using space‑filling curves and Z‑Order to improve query I/O, details the OPTIMIZE implementation, and shares performance benchmark results and future plans.

Apache IcebergBig DataData Skipping

0 likes · 12 min read

Optimizing Query Performance in Apache Iceberg with Z‑Order Data Organization

ITPUB

Apr 27, 2022 · Databases

Mastering Data Warehouse Standards: Architecture, Layer Design, and Naming Conventions

This comprehensive guide explains data‑warehouse construction standards, covering model architecture principles, public development rules, layer‑by‑layer design specifications, and systematic naming conventions for tables, dimensions, and metrics to ensure consistency, scalability, and reliable data governance.

Big DataDatabase StandardsETL

0 likes · 26 min read

Mastering Data Warehouse Standards: Architecture, Layer Design, and Naming Conventions

Big Data Technology & Architecture

Apr 27, 2022 · Big Data

Understanding Window Table-Valued Functions (TVF) in Flink and Their Optimizations

This article explains Flink's window table-valued functions (TVF), shows how they replace the old grouped‑window syntax with concrete SQL examples, describes the physical planning rules, introduces sliced windows for state efficiency, and presents a small incremental‑output improvement for cumulative windows.

Big DataFlinkStreaming

0 likes · 11 min read

Understanding Window Table-Valued Functions (TVF) in Flink and Their Optimizations

Big Data Technology & Architecture

Apr 26, 2022 · Big Data

ByteDance's Internal Presto OLAP Engine: Deployment, Performance Boosts, and Operational Practices

The article details ByteDance's large‑scale deployment of the Presto OLAP engine for ad‑hoc, BI, and near‑real‑time analytics, describing its architecture, multi‑coordinator high‑availability design, routing gateway, adaptive cancel, history server, materialized‑view support, Hudi connector integration, and how these innovations improve performance, stability, and operational efficiency.

Big DataHudi ConnectorMaterialized Views

0 likes · 11 min read

ByteDance's Internal Presto OLAP Engine: Deployment, Performance Boosts, and Operational Practices

Architects Research Society

Apr 25, 2022 · Artificial Intelligence

Reflecting on a Decade of Data Science and Implications for Future Visualization Tools

The article reviews a decade‑long growth of data science, defines its multidisciplinary nature, outlines the four high‑level and fourteen low‑level processes, describes nine distinct data‑science roles, and discusses how these insights can guide the design of next‑generation data‑visualization and analysis tools.

Big DataData ScienceRoles

0 likes · 10 min read

Reflecting on a Decade of Data Science and Implications for Future Visualization Tools

DataFunTalk

Apr 25, 2022 · Big Data

Comprehensive Guide to Flink Deployment, State Programming, Checkpointing, and Performance Tuning

This article provides an extensive overview of Apache Flink, covering deployment modes, cluster sizing, job submission workflows, state programming concepts, checkpoint mechanisms, backpressure handling, comparison with Spark, and practical code snippets for configuration and optimization.

Big DataCheckpointFlink

0 likes · 48 min read

Comprehensive Guide to Flink Deployment, State Programming, Checkpointing, and Performance Tuning

Bilibili Tech

Apr 25, 2022 · Big Data

Optimizing Full Partition Tables with Zipper Tables, Hudi+Flink CDC, and Data Warehouse Strategies

Facing server‑hardware constraints, Bilibili’s data platform replaced wasteful full‑partition tables with a zipper‑table approach—preserving change history while cutting storage from petabytes to terabytes—and complemented it with Hudi + Flink CDC for near‑real‑time updates, dramatically lowering I/O, compute usage and latency.

Big DataFlink CDCHudi

0 likes · 11 min read

Optimizing Full Partition Tables with Zipper Tables, Hudi+Flink CDC, and Data Warehouse Strategies

Top Architect

Apr 23, 2022 · Big Data

Ensuring No Duplicate and No Loss in Baidu Log Middle Platform: Architecture, Challenges, and Solutions

This article explains the design, implementation, and future plans of Baidu's log middle platform, detailing its lifecycle management, service architecture, data reliability goals of eliminating duplication and loss, and the technical measures taken across SDKs, servers, and streaming pipelines to achieve near‑100% data integrity.

Backend ArchitectureBig DataData Reliability

0 likes · 15 min read

Ensuring No Duplicate and No Loss in Baidu Log Middle Platform: Architecture, Challenges, and Solutions

Snowball Engineer Team

Apr 21, 2022 · Big Data

Migrating from Hive3 on Tez to Spark SQL: Practices, Challenges, and Performance Evaluation

This article details the Snowball data team's migration from Hive3 on Tez to Spark SQL, covering the motivations, comparative performance tests, encountered compatibility issues, configuration work‑arounds, and future plans for consolidating ETL workloads on Spark.

Big DataETLPerformance

0 likes · 13 min read

Migrating from Hive3 on Tez to Spark SQL: Practices, Challenges, and Performance Evaluation

DataFunSummit

Apr 20, 2022 · Big Data

SuperSQL: A Cross‑Engine, Cross‑DC High‑Performance Big Data SQL Middleware

The article presents SuperSQL, a high‑performance big‑data SQL middleware that enables cross‑engine and cross‑data‑center query processing, detailing its architecture, metadata management, cost‑based optimization, operator push‑down, distributed execution, performance benchmarks, and future roadmap within modern data‑intensive environments.

Big DataCross-DCSQL Middleware

0 likes · 24 min read

SuperSQL: A Cross‑Engine, Cross‑DC High‑Performance Big Data SQL Middleware

Big Data Technology & Architecture

Apr 20, 2022 · Big Data

Fine‑Grained Resource Management in Apache Flink: Scenarios, Mechanism, Efficiency, Allocation Strategies, and Limitations

This article explains Apache Flink's fine‑grained resource management, describing typical use cases, the slot‑based mechanism, how it improves resource efficiency, the default allocation strategy, current limitations, and provides example code for configuring slot sharing groups.

Apache FlinkBig DataFine-Grained Resource Management

0 likes · 12 min read

Fine‑Grained Resource Management in Apache Flink: Scenarios, Mechanism, Efficiency, Allocation Strategies, and Limitations

Huawei Cloud Developer Alliance

Apr 19, 2022 · Cloud Native

How Volcano Is Shaping Cloud‑Native Batch Computing in CNCF

Volcano, the first cloud‑native batch computing project contributed by Huawei Cloud, has been promoted to a CNCF incubating project, signaling broad industry recognition and outlining a roadmap that includes cross‑cloud scheduling, GPU virtualization, fine‑grained resource management, and workflow orchestration for AI and big‑data workloads.

AIBatch ComputingBig Data

0 likes · 8 min read

How Volcano Is Shaping Cloud‑Native Batch Computing in CNCF

ITPUB

Apr 19, 2022 · Big Data

Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive

This article explains why modern enterprises need real‑time data‑warehouse architectures, breaks down traditional layered warehouse concepts, compares Lambda and Kappa models, evaluates five practical real‑time solutions—including Iceberg‑based lakehouse and MPP databases—provides code snippets, and offers selection guidance with real‑world company examples.

Big DataFlinkIceberg

0 likes · 19 min read

Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive

Big Data Technology & Architecture

Apr 19, 2022 · Big Data

Understanding Flink Checkpoint and Unaligned Checkpoint Mechanisms

This article explains Flink's fundamental checkpoint mechanism, its coupling with backpressure, and how the introduction of Unaligned Checkpoint in Flink 1.11 decouples checkpointing from backpressure to improve latency and resource utilization in high‑backpressure streaming jobs.

Big DataCheckpointFlink

0 likes · 14 min read

Understanding Flink Checkpoint and Unaligned Checkpoint Mechanisms

Architect

Apr 18, 2022 · Big Data

Ensuring Data Accuracy and Reliability in Baidu's Log Middle Platform

This article describes Baidu's log middle platform architecture, its data lifecycle management, integration status, terminology, service overview, core challenges of ensuring data accuracy, and the implemented optimizations for persistent storage, service decomposition, and SDK reporting to achieve near‑100% no‑repeat no‑loss reliability.

Backend ArchitectureBig DataData Reliability

0 likes · 15 min read

Ensuring Data Accuracy and Reliability in Baidu's Log Middle Platform

DeWu Technology

Apr 18, 2022 · Artificial Intelligence

Warehouse Storage Location Recommendation: Architecture, Recall, and Ranking Strategies

The article outlines DeWu’s warehouse‑management recommendation system, which combines an online‑near‑line‑offline architecture to quickly recall viable shelf slots and rank them by space utilization, travel time, and sales potential, enabling automated, constraint‑aware placement that cuts picking time and inventory costs.

AIBig DataStorage Optimization

0 likes · 16 min read

Warehouse Storage Location Recommendation: Architecture, Recall, and Ranking Strategies

Big Data Technology & Architecture

Apr 18, 2022 · Big Data

Overview of Meituan Merchant Version Data Metrics System

This article provides a comprehensive overview of Meituan's merchant platform data metric system, detailing its five main modules—overview, business, traffic, customer, and product—along with competitor analysis and actionable insights for merchants to improve operations and growth.

Big DataData AnalyticsMeituan

0 likes · 11 min read

Overview of Meituan Merchant Version Data Metrics System

JavaEdge

Apr 17, 2022 · Big Data

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

The article explains how Spark, developed by UC Berkeley's AMP Lab, quickly surpassed MapReduce by offering faster execution, a simpler Scala‑based programming model, lazy RDD transformations, a rich ecosystem including SQL, Streaming, MLlib and GraphX, and practical code examples such as a three‑line WordCount.

Big DataMapReduceRDD

0 likes · 7 min read

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

Big Data Technology & Architecture

Apr 15, 2022 · Big Data

Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query

This guide explains how to set up the Flink SQL client to work with Apache Iceberg, covering Scala version requirements, downloading and deploying Iceberg jars, configuring Hive and HDFS catalogs, creating databases and tables, performing insert and overwrite operations, and querying data in both batch and streaming modes.

Big DataCatalogFlink

0 likes · 18 min read

Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query

ByteDance Data Platform

Apr 15, 2022 · Cloud Native

How ByteHouse Evolved From ClickHouse Into a Next‑Gen Cloud‑Native Data Warehouse

ByteHouse, born from ByteDance’s extensive use of ClickHouse, transformed a high‑performance OLAP engine into a cloud‑native, scalable data warehouse by addressing scalability, elasticity, high availability, and multi‑tenant challenges through architectural redesign, custom storage layers, and advanced metadata management.

Big DataByteHouseClickHouse

0 likes · 19 min read

How ByteHouse Evolved From ClickHouse Into a Next‑Gen Cloud‑Native Data Warehouse

Big Data Technology & Architecture

Apr 14, 2022 · Big Data

Practical Guide to Monitoring Flink Performance, Detecting Backpressure, and Configuring Alerts

This article explains how to use Flink's Web UI, Kafka metrics, and YARN monitoring to observe performance, diagnose backpressure, and set alert thresholds, providing code examples and practical tips for reliable stream processing in production environments.

Big DataFlinkKafka

0 likes · 9 min read

Practical Guide to Monitoring Flink Performance, Detecting Backpressure, and Configuring Alerts

vivo Internet Technology

Apr 13, 2022 · Big Data

Understanding Join Algorithms in Presto: Theory, Implementation, and Engineering Practices

The article explains Presto’s join processing by detailing the business need to limit multi‑table joins, then describing nested‑loop, sort‑merge, and hash join algorithms with Java examples, and finally showing how the Volcano model, columnar pages, and planner integration enable scalable, efficient OLAP join execution.

Big DataHash JoinJoin Algorithms

0 likes · 17 min read

Understanding Join Algorithms in Presto: Theory, Implementation, and Engineering Practices

Zuoyebang Tech Team

Apr 13, 2022 · Big Data

How Delta Lake Transformed Our Offline Data Warehouse Performance

This article details how ZuoYeBang's engineering team migrated their Hive‑based offline data warehouse to Delta Lake, tackling latency, scalability, and query‑performance challenges through stream‑to‑batch processing, data‑lake architecture, and optimizations like DPP and Z‑ordering.

Big DataDelta LakePresto

0 likes · 15 min read

How Delta Lake Transformed Our Offline Data Warehouse Performance

Cloud Native Technology Community

Apr 13, 2022 · Big Data

Introduction to ClickHouse: Features, Architecture, Installation, Data Types, and Cluster Deployment

This article provides a comprehensive overview of ClickHouse, an open‑source column‑oriented MPP analytical database, covering its advantages and drawbacks, key features, typical use cases, data access flow, installation steps, core directories, indexes, data types, database and table engines, as well as detailed cluster architecture and deployment patterns.

Big DataClickHouseCluster

0 likes · 29 min read

Introduction to ClickHouse: Features, Architecture, Installation, Data Types, and Cluster Deployment

StarRocks

Apr 13, 2022 · Big Data

How StarRocks Achieves Lightning‑Fast Data Lake Analytics

This article explains StarRocks' streamlined architecture, cost‑based optimizer, massively parallel processing and vectorized engine, and how they enable high‑performance queries over data stored in Hive, Iceberg, Hudi and other lake formats, backed by benchmark results and future roadmap details.

Big DataCBOData Lake

0 likes · 19 min read

How StarRocks Achieves Lightning‑Fast Data Lake Analytics

DataFunTalk

Apr 13, 2022 · Databases

Adopting StarRocks for Real‑Time Analytics in ZhongAn’s JiZhi Platform: A Performance Comparison with ClickHouse

This article describes how ZhongAn Insurance’s JiZhi data‑analysis platform migrated from ClickHouse to the MPP OLAP engine StarRocks, detailing the business requirements, architectural challenges, benchmark results across single‑table and multi‑table queries, and the resulting improvements in latency, concurrency, and operational simplicity for real‑time analytics.

Big DataClickHouseOLAP

0 likes · 14 min read

Adopting StarRocks for Real‑Time Analytics in ZhongAn’s JiZhi Platform: A Performance Comparison with ClickHouse

IT Services Circle

Apr 12, 2022 · Big Data

Finding Missing Unsigned Integers in a 4‑Billion‑Element File Using Interval Counting and Bitmap Technique

The article explains how to locate all missing 32‑bit unsigned integers in a 4 billion‑entry file by first partitioning the range into intervals, counting entries per interval with a tiny int[64] array, and then applying a bitmap method only to under‑filled intervals, achieving a memory footprint of just a few hundred bytes.

Big DataMemory Optimizationalgorithm

0 likes · 5 min read

Finding Missing Unsigned Integers in a 4‑Billion‑Element File Using Interval Counting and Bitmap Technique

High Availability Architecture

Apr 11, 2022 · Big Data

Ensuring Data Accuracy and Reliability in Baidu Log Platform: Architecture, Challenges, and Solutions

This article introduces the current state of Baidu's log platform, explains its lifecycle from data collection to downstream applications, analyzes the challenges of achieving near‑zero duplication and loss, and presents architectural optimizations and best‑practice recommendations to improve data stability and accuracy across the system.

Big DataData ReliabilitySystem Architecture

0 likes · 19 min read

Ensuring Data Accuracy and Reliability in Baidu Log Platform: Architecture, Challenges, and Solutions

DataFunSummit

Apr 9, 2022 · Big Data

Impala Deployment and Optimization: Practical Experience with Sensor Data Multi‑dimensional Analysis Platform

This article presents a comprehensive technical walkthrough of Sensor Data's multi‑dimensional analysis platform, covering product architecture, an Impala‑based real‑time query engine, query performance tuning, resource‑estimation strategies, and future plans, with concrete diagrams, test results, and community contributions.

Big DataData ArchitectureImpala

0 likes · 19 min read

Impala Deployment and Optimization: Practical Experience with Sensor Data Multi‑dimensional Analysis Platform

DataFunTalk

Apr 9, 2022 · Big Data

Optimizing Apache Iceberg Query Performance with Z‑Order Data Organization

This talk explains how Apache Iceberg’s DataSkipping can lose efficiency with many filter columns, and presents a data‑organization redesign using space‑filling curves and Z‑Order to improve query I/O, detailing the OPTIMIZE syntax, implementation steps, performance benchmarks, and future roadmap.

Apache IcebergBig DataData Skipping

0 likes · 12 min read

Optimizing Apache Iceberg Query Performance with Z‑Order Data Organization

Bilibili Tech

Apr 9, 2022 · Big Data

Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements

Bilibili’s Presto on Hadoop combines a multi‑engine offline platform with Kubernetes‑managed YARN scheduling, Ranger security, and a custom dispatcher, scaling to over 400 nodes handling 160 k daily queries on 10 PB, while adding coordinator HA, resource‑group punishment, query limits, Alluxio caching, dynamic filtering, and numerous SQL‑level enhancements, with future auto‑scaling and materialized‑view automation.

Big DataHadoopPresto

0 likes · 30 min read

Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements

ByteDance Data Platform

Apr 8, 2022 · Operations

How Baseline Monitoring Transforms Data Pipeline Reliability at ByteDance

This article explains ByteDance's baseline monitoring system for data pipelines, detailing its motivation, core concepts, architecture, instance generation, alert types, and handling of complex task dependencies to reduce operational costs and improve SLA compliance across hundreds of projects.

AlertingBig Databaseline monitoring

0 likes · 21 min read

How Baseline Monitoring Transforms Data Pipeline Reliability at ByteDance

Big Data Technology & Architecture

Apr 7, 2022 · Big Data

Understanding Kafka Producer Idempotence: Mechanisms and Implementation Details

This article explains how Kafka achieves producer idempotence by assigning unique producer IDs and sequence numbers, describes the broker’s validation process, and walks through the relevant producer‑side and broker‑side code paths, highlighting configuration considerations and limitations.

Big DataBrokerIdempotence

0 likes · 13 min read

Understanding Kafka Producer Idempotence: Mechanisms and Implementation Details

DataFunTalk

Apr 7, 2022 · Big Data

Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment

This article introduces Apache Kyuubi—a multi‑tenant Thrift JDBC/ODBC service built on Spark—detailing its architecture, advantages over Spark Thrift Server, real‑world use cases, open‑source community progress, and practical deployment strategies on mobile cloud, Kubernetes, and with Trino.

Apache SparkBig DataKyuubi

0 likes · 16 min read

Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment

DataFunSummit

Apr 6, 2022 · Big Data

Real-time Dimension Modeling with Flink SQL: Challenges and Solutions

This article presents a JD.com case study on applying Flink SQL for real‑time dimension modeling, detailing two complex streaming scenarios—full‑join of multiple streams and full‑group aggregation—along with the associated challenges of historical data handling, state management, and performance optimization, and proposes component‑based architectural solutions.

Big DataFlinkReal-Time

0 likes · 14 min read

Real-time Dimension Modeling with Flink SQL: Challenges and Solutions

MaGe Linux Operations

Apr 5, 2022 · Big Data

Recreating Google Ngram Trends with Python: Analyzing 1.4 Billion Rows Efficiently

This article demonstrates how to use Python, the PyTubes library, and NumPy to load, process, and visualize the massive Google Ngram 1‑gram dataset—over 1.4 billion records—showing performance considerations, data‑cleaning steps, and comparative language trends for Python, Pascal, and Perl.

Big DataNGramNumPy

0 likes · 10 min read

Recreating Google Ngram Trends with Python: Analyzing 1.4 Billion Rows Efficiently

Big Data Technology & Architecture

Apr 5, 2022 · Big Data

Using ElasticsearchSink with Apache Flink: Configuration, Retry Strategies, and Failure Handling

This article introduces the ElasticsearchSink for Apache Flink, explains how to add Maven dependencies, implement the sink with configuration and retry settings, details failure handlers, and highlights important considerations such as exception handling and checkpoint requirements for reliable streaming pipelines.

Big DataElasticsearchFailure Handling

0 likes · 9 min read

Using ElasticsearchSink with Apache Flink: Configuration, Retry Strategies, and Failure Handling

DataFunTalk

Apr 4, 2022 · Big Data

Impala Deployment and Optimization in Sensors Data's Multi-Dimensional Analytics Platform

This article details the architecture of Sensors Data's analytics platform, the implementation of a real‑time Impala query engine, multiple query‑performance optimizations—including storage redesign, user‑behavior sequence tuning, join elimination and expression push‑down—and a resource‑estimation framework that dramatically reduces query failures and latency.

Big DataData PlatformImpala

0 likes · 16 min read

Impala Deployment and Optimization in Sensors Data's Multi-Dimensional Analytics Platform

DataFunTalk

Apr 2, 2022 · Big Data

SuperSQL: A High‑Performance Cross‑Engine, Cross‑Data‑Center SQL Middleware for Big Data

The article introduces SuperSQL, a federated SQL middleware that unifies heterogeneous data sources across multiple data centers, leverages Apache Calcite for cost‑based optimization, pushes down operators to various engines, manages metadata with a Trie model, and demonstrates significant performance gains over traditional solutions.

Big DataCross‑Data‑CenterSQL Middleware

0 likes · 27 min read

SuperSQL: A High‑Performance Cross‑Engine, Cross‑Data‑Center SQL Middleware for Big Data

DataFunTalk

Apr 1, 2022 · Operations

Integrated Digital Supply Chain: JD Logistics' Intelligent Planning, Algorithm Platform, and Digital Twin Practices

This article explores JD Logistics' integrated digital supply chain, detailing its evolution, the construction of an algorithm middle‑platform, engineering platforms, digital twin system, real‑world case studies, and future talent and ecosystem directions, illustrating how AI and big‑data technologies drive end‑to‑end logistics optimization.

AI OptimizationAlgorithm PlatformBig Data

0 likes · 16 min read

Integrated Digital Supply Chain: JD Logistics' Intelligent Planning, Algorithm Platform, and Digital Twin Practices

Big Data Technology & Architecture

Mar 31, 2022 · Big Data

Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg

To address the high cost and low efficiency of traditional Hadoop‑based data pipelines, Bilibili designed a lakehouse solution using Apache Iceberg, integrating Spark, Flink, Trino, and Alluxio to unify flexible data lake storage with warehouse‑level query performance, reducing data duplication and improving interactive analytics.

Big DataIcebergLakehouse

0 likes · 17 min read

Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg

DataFunTalk

Mar 30, 2022 · Big Data

NetEase Big Data Platform: HDFS Optimization and Practice

This article presents NetEase's big data platform architecture, detailing multi‑layer storage and compute design, HDFS deployment challenges, NameNode and NameSpace performance optimizations, cluster scaling strategies, data tiering, hardware upgrades, and real‑world business use cases, illustrating practical large‑scale big data engineering.

Big DataCluster OptimizationData Management

0 likes · 23 min read

NetEase Big Data Platform: HDFS Optimization and Practice

21CTO

Mar 30, 2022 · Big Data

What Drives Taobao App Users? Insights from AARRR and RFM Analyses

This article analyzes 2 million Taobao app user‑behavior records using AARRR funnel metrics and RFM segmentation, revealing daily and hourly usage patterns, conversion bottlenecks, product‑search mismatches, and offering data‑driven marketing recommendations to boost retention and sales.

AARRRBig DataRFM

0 likes · 25 min read

What Drives Taobao App Users? Insights from AARRR and RFM Analyses

Bilibili Tech

Mar 30, 2022 · Big Data

HDFS Architecture, Optimizations, and Future Plans at Bilibili

Bilibili’s HDFS now runs a three‑tier architecture—access, metadata, and data layers—enhanced with a custom MergeFS router, observer NameNode, dynamic load balancing, fast‑failover pipelines, and storage‑aware policies, while future work targets transparent erasure coding, tiered data routing, lock refinements, and a Hadoop 3.x migration.

Big DataDistributed File SystemHDFS

0 likes · 22 min read

HDFS Architecture, Optimizations, and Future Plans at Bilibili

Efficient Ops

Mar 29, 2022 · Big Data

How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations

This article explains how Tencent Cloud's APM metric calculation, which transforms massive Span data into aggregated metrics using Flink, faced performance bottlenecks and was optimized through job splitting, batch merging, and dimension pruning, ultimately achieving a 2‑3× speed increase and cutting resource usage to about 30% of the original.

APMBig DataFlink

0 likes · 10 min read

How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations

DataFunTalk

Mar 29, 2022 · Big Data

FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements

This article introduces the FlinkX framework for multi‑source heterogeneous data synchronization, detailing its background, core functions such as checkpoint‑based resume, metric monitoring, rate limiting, plugin architecture, cloud‑native K8s deployment, Hudi integration, and future roadmap, while also addressing common Q&A topics.

BatchBig DataData Lake

0 likes · 14 min read

FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements

58 Tech

Mar 29, 2022 · Big Data

Design and Implementation of the 58 Group Penalty Data Center

This article presents the design, architecture, and implementation of a unified penalty data center for 58 Group, detailing the challenges of heterogeneous data sources, the selection of Flink for real‑time ETL, the use of a DSL and LRU aggregation, and the adoption of MVEL for feature recognition to achieve standardized, high‑performance penalty data processing.

Big DataETLFlink

0 likes · 13 min read

Design and Implementation of the 58 Group Penalty Data Center

NetEase Smart Enterprise Tech+

Mar 29, 2022 · Big Data

Automating Consumer Insight Testing with Spark, Hive, and ClickHouse

This article explains how to build a big‑data consumer insight platform using Spark applications, Hive, MySQL and ClickHouse, and how to automate data validation and algorithm testing to improve coverage, efficiency, and reliability of insight services.

Automated TestingBig DataClickHouse

0 likes · 8 min read

Automating Consumer Insight Testing with Spark, Hive, and ClickHouse

Big Data Technology & Architecture

Mar 28, 2022 · Big Data

Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents JD's real-time dimension modeling case using Flink SQL, detailing two complex streaming scenarios, the difficulties of handling historical data and state management, and a component‑based solution that leverages external KV stores and optimized Flink operators to improve performance and scalability.

Big DataFlinkReal-Time

0 likes · 13 min read

Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

Architects' Tech Alliance

Mar 28, 2022 · Artificial Intelligence

Digital Twin: Ten Fundamental Questions and Insights for Researchers, Decision‑Makers, and Practitioners

This article analyzes ten fundamental questions about digital twins, covering definitions, stakeholders, global interest, relationship with smart manufacturing, integration with New IT, scientific challenges, standards, and commercial tools, aiming to guide researchers, policymakers, and practitioners in understanding and applying digital twin technology.

AIBig DataDigital Twin

0 likes · 22 min read

Digital Twin: Ten Fundamental Questions and Insights for Researchers, Decision‑Makers, and Practitioners

Bilibili Tech

Mar 25, 2022 · Big Data

Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling

Bilibili transformed its YARN CapacityScheduler from a heartbeat‑driven design to a multi‑threaded global scheduler by separating lock handling, adopting Weighted Round‑Robin with DRF, adding batch node selection, fixing proposal inconsistencies, tuning GC and logging, and thereby reduced application allocation time by about 38 % on clusters of up to 8,000 nodes.

Big DataCapacitySchedulerHadoop

0 likes · 15 min read

Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling