Tagged articles
3675 articles
Page 14 of 37
Top Architect
Top Architect
Jan 2, 2023 · Big Data

Optimizing Kafka at Meituan: Challenges and Solutions for a Large‑Scale Data Platform

This article details Meituan's use of Kafka as a unified data cache and distribution layer, outlines the challenges of massive scale and latency, and presents comprehensive optimizations across application, system, and cluster management layers, including disk balancing, migration acceleration, fetcher isolation, and full‑link monitoring.

Big DataDistributed SystemsKafka
0 likes · 22 min read
Optimizing Kafka at Meituan: Challenges and Solutions for a Large‑Scale Data Platform
ITPUB
ITPUB
Dec 31, 2022 · Databases

Why HBase? Strengths, Weaknesses, Real‑World Scenarios, and Architecture Explained

This article examines HBase’s high reliability and performance as a column‑oriented NoSQL store, outlines its advantages and limitations, presents two practical use cases from e‑commerce, and details its data model, architecture components, and design considerations for effective deployment.

Big DataHBaseNoSQL
0 likes · 12 min read
Why HBase? Strengths, Weaknesses, Real‑World Scenarios, and Architecture Explained
Aikesheng Open Source Community
Aikesheng Open Source Community
Dec 31, 2022 · Databases

Understanding ClickHouse Performance: Storage Engine and Compute Engine Perspectives

This article explains why ClickHouse delivers high query speed by detailing storage‑engine optimizations such as pre‑sorting, columnar layout and compression, and compute‑engine techniques like vectorized execution, built‑in functions and minimal join usage, while also promoting the related book and giveaway.

Big DataClickHouseOLAP
0 likes · 9 min read
Understanding ClickHouse Performance: Storage Engine and Compute Engine Perspectives
Architect's Tech Stack
Architect's Tech Stack
Dec 30, 2022 · Big Data

Distributed Computing Is Not a Panacea for Big Data: Prioritize Single‑Node Performance First

While distributed clusters are popular for big‑data processing, they are not a universal solution; tasks that are hard to partition or involve heavy cross‑node communication often perform better on a well‑optimized single machine, making a careful analysis of workload characteristics essential before scaling out.

Algorithm OptimizationBig DataPerformance Tuning
0 likes · 14 min read
Distributed Computing Is Not a Panacea for Big Data: Prioritize Single‑Node Performance First
DataFunTalk
DataFunTalk
Dec 29, 2022 · Big Data

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

This article presents the background, requirements, architecture, key modules, and practical impact of OPPO's non‑intrusive big‑data diagnostic platform—named Compass—designed to quickly locate issues, provide optimization suggestions, and achieve cost‑saving and efficiency gains for large‑scale Spark and Hadoop workloads.

Big DataCost reductionHadoop
0 likes · 17 min read
Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)
ByteDance Data Platform
ByteDance Data Platform
Dec 28, 2022 · Big Data

How Cloud Data Warehouses Are Shaping the Future of Big Data and DataOps

This article examines the four‑stage evolution of data warehouses, highlights the cost‑effective, scalable advantages of cloud‑native warehouses, explores the rapid growth of data‑management infrastructure, and discusses the emerging practices of DataOps and AI integration that are redefining modern data stacks.

AIBig DataData Management
0 likes · 15 min read
How Cloud Data Warehouses Are Shaping the Future of Big Data and DataOps
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 28, 2022 · Big Data

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

Flink 1.16 introduces adaptive batch scheduling, speculative execution, hybrid shuffle, dynamic partition pruning, improved Hive SQL compatibility, advanced checkpoint mechanisms including changelog backend, and integrates CDC with Kafka and Table Store, offering faster, more stable, and easier-to-use stream‑batch processing capabilities.

Big DataCDCCheckpoint
0 likes · 8 min read
Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store
High Availability Architecture
High Availability Architecture
Dec 27, 2022 · Big Data

Design and Implementation of a Data Service Middle Platform for Scalable Data SaaS

This article presents a comprehensive overview of a data service middle platform, detailing its background, architectural design, data construction, model definition and acceleration, API creation, query processing, service gateway, common solutions for standardization and cost reduction, as well as achieved results and future plans.

APIArchitectureBig Data
0 likes · 22 min read
Design and Implementation of a Data Service Middle Platform for Scalable Data SaaS
Tencent Advertising Technology
Tencent Advertising Technology
Dec 27, 2022 · Big Data

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.

Big DataData LakeFlink
0 likes · 20 min read
Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink
DataFunTalk
DataFunTalk
Dec 24, 2022 · Big Data

Evolution of Data Platforms: From Early Computers to the Modern Data Stack

This article traces the history of data platforms—from the first general‑purpose computers and traditional BI, through the rise of data warehouses, big‑data frameworks like Hadoop, Spark and Flink, to the modern data‑stack era with cloud‑native architectures, Lambda/Kappa models, and emerging tools—highlighting key technologies, architectural shifts, and future prospects.

Big DataCloud ComputingETL
0 likes · 26 min read
Evolution of Data Platforms: From Early Computers to the Modern Data Stack
DataFunSummit
DataFunSummit
Dec 24, 2022 · Operations

Understanding DataOps: Evolution, Technology Stacks, and Industry Applications

This article explores DataOps from its historical evolution through the digital 3.0 era, outlines its core technology stacks such as Data Fabric, Data Mesh, and Modern Data Stack, and demonstrates practical applications across finance, manufacturing, telecom, and public services, highlighting its role in agile, cloud‑native data management.

Big DataData GovernanceDataOps
0 likes · 18 min read
Understanding DataOps: Evolution, Technology Stacks, and Industry Applications
Bilibili Tech
Bilibili Tech
Dec 23, 2022 · Big Data

Data Service Platform Architecture and Design

The article outlines a standardized data‑service platform built atop a warehouse, detailing its construction, query, and gateway layers—supporting model definition, acceleration, reusable APIs, unified DSL/SQL interfaces, and observability—to solve ingestion, definition, and lineage issues, achieving 500+ APIs, sub‑day creation, and 18% cost reduction.

Big DataData Serviceapi-gateway
0 likes · 22 min read
Data Service Platform Architecture and Design
DataFunSummit
DataFunSummit
Dec 22, 2022 · Big Data

SeaTunnel: An Open‑Source Ultra‑Scale Data Integration Platform – Design Goals, Architecture, and Future Roadmap

This article introduces SeaTunnel, an open‑source ultra‑large‑scale data integration platform, covering its design objectives, current status with over 50 connectors and multi‑engine support, overall architecture, execution flow, connector translation, source and sink APIs, global commit strategies, table & catalog APIs, and the upcoming roadmap for connector expansion, a web UI, and a dedicated engine.

Big DataConnectorOpen-source
0 likes · 10 min read
SeaTunnel: An Open‑Source Ultra‑Scale Data Integration Platform – Design Goals, Architecture, and Future Roadmap
ITPUB
ITPUB
Dec 21, 2022 · Big Data

How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs

This article details Bilibili's extensive enhancements to the Flink runtime—including checkpoint recoverability, max‑parallelism calculations, State Processor API extensions, Full and Regional Checkpoints, hybrid HA, task‑level recovery, load‑balanced partitioners, and large‑scale cluster maintenance—to improve reliability and performance of its billion‑scale streaming workloads.

Big DataCheckpointFlink
0 likes · 33 min read
How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs
DataFunSummit
DataFunSummit
Dec 21, 2022 · Big Data

Big Data Platform Architecture: Expert Insights on Components, Challenges, and Trends

An expert interview series examines the architecture of big data platforms, detailing core modules such as data integration, storage, computation, scheduling, and query analysis, while highlighting current challenges, best‑practice tools, and future trends like cloud‑native, object storage, and real‑time processing.

Big DataQuery EnginesScheduling
0 likes · 12 min read
Big Data Platform Architecture: Expert Insights on Components, Challenges, and Trends
Xianyu Technology
Xianyu Technology
Dec 21, 2022 · Artificial Intelligence

Xianyu Recommendation System: Architecture, Challenges, and Deployment

The Xianyu recommendation system, built by backend expert Wan Xiaoyong, evolved from offline scoring to a full‑graph, serverless recall‑ranking pipeline that tackles C2C uncertainties through centralized feature engineering, model compression, staged deployment, flexible experimentation, robust governance, and plans for automated attribution and interpretability.

AIBig DataModel Deployment
0 likes · 10 min read
Xianyu Recommendation System: Architecture, Challenges, and Deployment
DataFunSummit
DataFunSummit
Dec 20, 2022 · Big Data

JD Retail Big Data OLAP Application and Practice

This talk presents JD Retail’s big‑data OLAP solution, covering the massive, variable and complex traffic data challenges, the custom data‑ingestion and versioned update tools, ClickHouse query‑architecture upgrades, optimization techniques, and future plans for multi‑cluster querying and pre‑computation.

Big DataClickHouseJD Retail
0 likes · 21 min read
JD Retail Big Data OLAP Application and Practice
Top Architect
Top Architect
Dec 20, 2022 · Databases

Elasticsearch DSL Query Syntax Overview (Version 7.x)

This article provides a comprehensive beginner-friendly guide to Elasticsearch 7.x DSL query syntax, covering core keywords, mapping types, query examples, boolean logic, and code snippets to help readers understand and construct effective search queries.

Big DataDSLdatabase
0 likes · 8 min read
Elasticsearch DSL Query Syntax Overview (Version 7.x)
Data Thinking Notes
Data Thinking Notes
Dec 19, 2022 · Big Data

Data Quality Mastery: From Expectations to Operational Assurance

This article outlines a comprehensive data quality management framework, covering expectations, measurement, assurance, and operational practices, and provides concrete templates, rule designs, and governance processes to help data teams systematically assess, monitor, and improve data reliability throughout the lifecycle.

Big DataData GovernanceData Quality
0 likes · 18 min read
Data Quality Mastery: From Expectations to Operational Assurance
ITPUB
ITPUB
Dec 18, 2022 · Databases

Why ClickHouse Is So Fast: Deep Dive into Storage and Compute Engine Optimizations

This article explains how ClickHouse achieves high query performance by leveraging storage‑engine designs such as pre‑sorting, columnar layout, and block‑level compression, and by exploiting a vectorized compute engine while avoiding joins and using built‑in functions.

Big DataClickHouseColumnar Storage
0 likes · 9 min read
Why ClickHouse Is So Fast: Deep Dive into Storage and Compute Engine Optimizations
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 15, 2022 · Big Data

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

This technical article presents a comprehensive overview of migrating Hive SQL to Flink SQL, covering the motivations behind the migration, key challenges such as compatibility, stability and performance, practical implementation steps, a detailed demo, future development directions, and a Q&A session addressing common concerns.

Batch ProcessingBig DataData Lake
0 likes · 13 min read
Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans
DataFunTalk
DataFunTalk
Dec 14, 2022 · Big Data

Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management

This article explains why the financial sector is moving its big‑data workloads to cloud‑native platforms, compares cloud‑native systems with traditional Hadoop, describes deployment options such as Serverless YARN and Arcee Operator, and details the high‑performance GRO scheduler, agent, and ResLake resource‑lake architecture that together improve resource utilization, reduce costs, and ensure reliable, low‑latency processing for finance workloads.

Big DataCloud Nativeresource scheduling
0 likes · 19 min read
Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management
dbaplus Community
dbaplus Community
Dec 13, 2022 · Big Data

How ClickHouse Powers Real-Time Self-Service Analytics at Scale

Facing massive daily data volumes and complex, ad‑hoc analytical needs, Zhaozhuan’s engineering team evaluated multiple OLAP engines and chose ClickHouse, then built a four‑layer self‑service analytics platform, detailing architecture, use‑cases, performance tuning, large‑scale joins, and future roadmap challenges.

Big DataClickHouseData Architecture
0 likes · 14 min read
How ClickHouse Powers Real-Time Self-Service Analytics at Scale
DataFunSummit
DataFunSummit
Dec 13, 2022 · Big Data

Introducing the Star River Big Data Development Platform: Architecture, Core Capabilities, and Future Plans

This article presents an in‑depth overview of 58.com’s self‑built Star River big data platform, covering its evolution across three eras, resource management hierarchy, core technical capabilities such as metadata services, data maps and lineage, governance practices, and the roadmap for further enhancements.

ArchitectureBig DataData Governance
0 likes · 14 min read
Introducing the Star River Big Data Development Platform: Architecture, Core Capabilities, and Future Plans
DataFunTalk
DataFunTalk
Dec 12, 2022 · Big Data

Cloud‑Native and Intelligent Fusion: Key Trends Shaping the Future of Big Data

The article explains how cloud‑native architectures, data governance, intelligent fusion, and privacy computing are driving the evolution of big data, recounting the history from Google’s early papers and Hadoop to modern managed services, compute‑storage separation, AI‑powered recommendation platforms, and real‑world success cases.

Big DataCloud ComputingCloud Native
0 likes · 10 min read
Cloud‑Native and Intelligent Fusion: Key Trends Shaping the Future of Big Data
AntTech
AntTech
Dec 11, 2022 · Information Security

Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration

Occlum v1.0, the open‑source trusted execution environment operating system released by Ant Group, delivers up to five‑fold performance improvements, supports over 150 Linux syscalls, introduces async I/O, dynamic memory management, and a Spark‑BigDL big‑data analysis solution, while outlining future GPU and TDX extensions.

Big DataConfidential ComputingOcclum
0 likes · 11 min read
Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration
DataFunSummit
DataFunSummit
Dec 10, 2022 · Big Data

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

This presentation details how Guanyuan Data leverages Apache Spark within its self‑service analytics platform, covering product features, flexible deployment, resource isolation, performance challenges, architectural solutions, and future cloud‑native enhancements to support thousands of users and massive query workloads.

Apache SparkBig DataData Platform
0 likes · 14 min read
Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions
ITPUB
ITPUB
Dec 10, 2022 · Big Data

How ClickHouse Powers Real-Time Self-Service Analytics at Scale

This article examines why ClickHouse was chosen as the OLAP engine for a massive self‑service analytics platform, describes the system architecture, shares concrete memory and performance tuning parameters, and outlines current challenges and future roadmap for large‑scale real‑time data analysis.

Big DataClickHouseData Architecture
0 likes · 14 min read
How ClickHouse Powers Real-Time Self-Service Analytics at Scale
php Courses
php Courses
Dec 9, 2022 · Databases

Elasticsearch Index and Document Operations Tutorial

This tutorial explains how to create, query, update, and delete Elasticsearch indices and documents using RESTful HTTP requests, covering basic CRUD operations, various query types, pagination, sorting, aggregations, highlighting, and mapping definitions with practical JSON examples.

Big DataElasticsearchJSON
0 likes · 8 min read
Elasticsearch Index and Document Operations Tutorial
DataFunSummit
DataFunSummit
Dec 7, 2022 · Big Data

Modern Data Governance at NetEase DataFan: Evolution, Challenges, and Solutions

This article details NetEase DataFan's journey in building a full‑stack big‑data platform, explains the design‑first data‑mid‑platform approach, analyzes cost, quality, and security problems encountered, and presents the modern data‑governance framework that integrates development, governance, and consumption into a closed loop.

Big DataCost ManagementData Governance
0 likes · 22 min read
Modern Data Governance at NetEase DataFan: Evolution, Challenges, and Solutions
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 7, 2022 · Databases

How Lindorm Cut Costs and Boost Performance for Alibaba’s Massive Data Workloads

This article reviews Lindorm’s evolution from its HBase‑based 1.0 architecture to the cloud‑native 2.0 version, outlines 2022’s cost‑saving and efficiency challenges, details compression, storage, time‑series and SQL enhancements, and shares real‑world case studies demonstrating significant cost reductions and performance gains.

Big DataCost reductionLindorm
0 likes · 24 min read
How Lindorm Cut Costs and Boost Performance for Alibaba’s Massive Data Workloads
Data Thinking Notes
Data Thinking Notes
Dec 5, 2022 · Big Data

How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance

This article details NetEase Cloud Music's year‑long data governance initiative, covering data background, governance strategy, project plan, practical actions, results, and future outlook, and shows how metadata‑driven management reduced storage by over 30% while improving reliability and efficiency.

Big DataCost OptimizationData Governance
0 likes · 17 min read
How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance
DataFunSummit
DataFunSummit
Dec 5, 2022 · Big Data

Impala Cluster Performance Optimization Based on Historical Queries: Practices and Solutions

This article presents a comprehensive overview of Impala cluster performance optimization using historical query analysis, covering background, high‑performance data‑warehouse construction principles, identified pain points, HBO implementation details, optimization techniques, and future development plans for the Impala ecosystem.

Big DataHBOHistorical Queries
0 likes · 16 min read
Impala Cluster Performance Optimization Based on Historical Queries: Practices and Solutions
Top Architect
Top Architect
Dec 4, 2022 · Databases

Deep Dive into Elasticsearch Pagination: from/size, Scroll, and Search After

This article explains how Elasticsearch handles deep pagination, compares the traditional from/size method with Scroll and Search After techniques, details their internal query and fetch phases, provides practical code examples, and offers guidance on choosing the right approach for large‑scale search workloads.

Big Datapaginationscroll
0 likes · 15 min read
Deep Dive into Elasticsearch Pagination: from/size, Scroll, and Search After
Architects Research Society
Architects Research Society
Dec 3, 2022 · Databases

Solr vs Elasticsearch: Choosing the Right Search Engine for Your Organization

This article compares Solr and Elasticsearch, examining their cloud, analytics, and cognitive search capabilities, and provides guidance on selecting the most suitable engine based on factors such as deployment complexity, resource requirements, scalability, integration with Hadoop ecosystems, and specific organizational use cases.

Big DataComparisonElasticsearch
0 likes · 9 min read
Solr vs Elasticsearch: Choosing the Right Search Engine for Your Organization
DataFunSummit
DataFunSummit
Dec 2, 2022 · Big Data

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

BitSail, ByteDance’s open‑source data integration engine, unifies batch, streaming, and incremental data synchronization across heterogeneous sources, detailing its evolution from early Flink‑based prototypes to a mature, plugin‑driven architecture with multi‑engine support, low‑cost co‑development, and robust CDC lakehouse capabilities.

Big DataCDCFlink
0 likes · 19 min read
BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities
DataFunSummit
DataFunSummit
Dec 1, 2022 · Big Data

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

This article presents an overview of a smart city unified perception platform, detailing its modular architecture, solutions for multi-source heterogeneity, incremental synchronization strategies, and real-time API data collection, while discussing extensibility and practical implementation considerations.

Big DataData PlatformIncremental Sync
0 likes · 20 min read
City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies
Architecture Digest
Architecture Digest
Dec 1, 2022 · Big Data

Understanding Data Warehouse Architecture and Layered Design

This article explains the concepts, architecture, and layered design of data warehouses, covering data flow, ETL processes, ODS, DWD, DWM, DWS, ADS layers, their characteristics, differences from databases, and the role of data marts in supporting OLAP and decision‑making.

AnalyticsBig DataData Layers
0 likes · 13 min read
Understanding Data Warehouse Architecture and Layered Design
21CTO
21CTO
Nov 30, 2022 · Big Data

Mastering Data Sharding: Hash, Range, and Consistent Hash Techniques

This article explains core data sharding concepts and models—including hash‑based, range‑based, and consistent hashing—detailing their mappings, routing strategies, scalability considerations, and practical implementation examples for handling massive datasets in distributed systems.

Big DataHashingconsistent hashing
0 likes · 11 min read
Mastering Data Sharding: Hash, Range, and Consistent Hash Techniques
DeWu Technology
DeWu Technology
Nov 30, 2022 · Big Data

Fundamentals and Implementation of Data Lineage in Big Data Environments

Data lineage in big‑data environments tracks how data moves and transforms—from source tables through SQL processing to final storage—enabling management tasks such as domain segmentation, performance tuning, anomaly detection, and dependency verification, with implementations ranging from simple regex extraction to robust AST parsing and optimization, as used by tools like Alibaba DataWorks and Apache Atlas.

ASTBig DataData Lineage
0 likes · 7 min read
Fundamentals and Implementation of Data Lineage in Big Data Environments
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 30, 2022 · Big Data

What’s New in Apache Flink 2022? Highlights from the Flink Forward Asia Summit

The 2022 Flink Forward Asia summit showcased Apache Flink’s rapid community growth, key technical breakthroughs such as distributed snapshot upgrades, cloud‑native state storage, hybrid shuffle, Flink CDC 2.0, and Flink ML 2.0, and real‑world deployments at companies like Midea, miHoYo and Disney.

Apache FlinkBig DataFlink Forward Asia
0 likes · 25 min read
What’s New in Apache Flink 2022? Highlights from the Flink Forward Asia Summit
Bilibili Tech
Bilibili Tech
Nov 29, 2022 · Big Data

How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations

This article details Bilibili's extensive enhancements to Flink's runtime—including checkpoint recoverability, operator ID stability, state processor extensions, hybrid high‑availability, regional checkpointing, and load‑based channel selection—to improve scalability, reliability, and operational efficiency of large‑scale streaming jobs.

Big DataCheckpointFlink
0 likes · 32 min read
How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 29, 2022 · Big Data

How Flink’s Stream‑Batch Fusion Is Transforming Real‑Time Big Data

The article explores Apache Flink’s eight‑year journey to becoming a top‑level Apache project, Alibaba’s extensive contributions, the rise of stream‑batch unified computing, its impact on real‑time data integration, cloud‑native deployment, and the emerging Flink‑based data‑warehouse and serverless solutions.

Apache FlinkBig DataCloud Native
0 likes · 15 min read
How Flink’s Stream‑Batch Fusion Is Transforming Real‑Time Big Data
Data Thinking Notes
Data Thinking Notes
Nov 28, 2022 · Big Data

Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality

This comprehensive guide explains how metadata connects source data, warehouses, and applications, outlines its technical and business classifications, demonstrates its value for data management, profiling, portals, and ETL development, and details optimization, storage, lifecycle, and quality practices essential for robust big‑data operations.

Big DataData QualityOperations
0 likes · 35 min read
Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 28, 2022 · Big Data

Comprehensive Guide to Big Data Interview Topics: Log Collection, Data Synchronization, Offline Development, Real‑time Technology, Data Services, and Data Mining

This article provides an extensive overview of big‑data interview subjects, covering browser and mobile log collection methods, data synchronization techniques (batch, real‑time, sharding), offline data development platforms, streaming architectures, data service evolution, performance optimization, and data‑mining layers and applications.

Big DataStreamingdata mining
0 likes · 17 min read
Comprehensive Guide to Big Data Interview Topics: Log Collection, Data Synchronization, Offline Development, Real‑time Technology, Data Services, and Data Mining
Volcano Engine Developer Services
Volcano Engine Developer Services
Nov 28, 2022 · Cloud Native

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

ByteDance’s cloud‑native computing team, led by Li Yakun, details how they transformed a Hadoop‑centric big‑data stack into a Kubernetes‑driven platform—customizing storage, middleware, and scheduling—to support petabyte‑scale workloads, achieve over 40% resource utilization, and sustain rapid product growth.

Big DataCloud NativeSpark
0 likes · 17 min read
How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok
DataFunTalk
DataFunTalk
Nov 25, 2022 · Operations

Overview of Volcano Engine A/B Experiment System Platform

This article presents a comprehensive overview of Volcano Engine's A/B testing platform, detailing its four core stages—reliable experiment system, efficient data construction, scientific statistical analysis, and fine-grained governance—while explaining execution components, data pipelines, statistical methods, and operational best practices for large‑scale experimentation.

A/B testingBig DataExperiment Platform
0 likes · 16 min read
Overview of Volcano Engine A/B Experiment System Platform
Data Thinking Notes
Data Thinking Notes
Nov 23, 2022 · Big Data

Mastering Fact Table Design: From Basics to Advanced Strategies

This comprehensive guide explains the fundamentals, design rules, and various types of fact tables—including transaction, snapshot, and aggregate tables—while detailing Kimball's four-step modeling process, grain declaration, handling of additive measures, and practical examples for effective data warehouse implementation.

Big DataFact TableKimball
0 likes · 16 min read
Mastering Fact Table Design: From Basics to Advanced Strategies
Data Thinking Notes
Data Thinking Notes
Nov 22, 2022 · Big Data

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.

Big DataData SkewPerformance Tuning
0 likes · 5 min read
Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It
DataFunSummit
DataFunSummit
Nov 22, 2022 · Big Data

BI Platform Practice at Xiaomi: Evolution, Architecture, and Future Directions

This article details Xiaomi's multi‑year journey in building a group‑wide Business Intelligence platform, covering its historical evolution, technical challenges in performance, modeling, visualization and permissions, the current four‑layer architecture, and future plans to make the platform more business‑centric and simpler.

AnalyticsBIBig Data
0 likes · 15 min read
BI Platform Practice at Xiaomi: Evolution, Architecture, and Future Directions
Top Architect
Top Architect
Nov 22, 2022 · Big Data

Efficient Massive Excel Import/Export with POI and EasyExcel in Java

This article explains how to efficiently import and export massive datasets (up to millions of rows) between Excel and databases using Apache POI, SXSSF, and Alibaba's EasyExcel, comparing workbook types, outlining performance considerations, and providing Java code examples for batch processing, paging, and transaction management.

Batch ProcessingBig DataExcel
0 likes · 23 min read
Efficient Massive Excel Import/Export with POI and EasyExcel in Java
Bilibili Tech
Bilibili Tech
Nov 22, 2022 · Big Data

Overview of the Berserker Big Data Platform and Its Data Development Architecture

The Berserker big‑data platform provides a one‑stop data development and governance solution built on over 40 micro‑services, featuring the Archer scheduler with CN and EN nodes, Raft‑based state management, Docker‑isolated task execution, smart routing, and plans to make EN stateless, migrate to Kubernetes, and unify batch and streaming services.

ArcherBig DataDocker
0 likes · 17 min read
Overview of the Berserker Big Data Platform and Its Data Development Architecture
DevOps Cloud Academy
DevOps Cloud Academy
Nov 22, 2022 · Big Data

Components and Key Terminology in Apache Airflow

Apache Airflow’s architecture consists of schedulers, executors, workers, a web server, and a metadata database, enabling scalable workflow orchestration, while essential terminology such as DAGs, operators, and sensors defines how tasks are organized, executed, and monitored within data pipelines.

Apache AirflowBig DataDAG
0 likes · 8 min read
Components and Key Terminology in Apache Airflow
Architects' Tech Alliance
Architects' Tech Alliance
Nov 20, 2022 · Databases

Columnar Storage vs Row Storage: Overview, Write/Read Comparison, Pros, Cons, and Use Cases

This article explains the differences between row-based and column-based storage, comparing their write and read performance, outlining advantages and disadvantages, and describing suitable scenarios such as OLAP queries, column families, compression, and indexing, to help choose the appropriate storage model.

Big DataColumnar StorageOLAP
0 likes · 10 min read
Columnar Storage vs Row Storage: Overview, Write/Read Comparison, Pros, Cons, and Use Cases
ITPUB
ITPUB
Nov 18, 2022 · Big Data

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

This article introduces Iceberg’s table‑format fundamentals, details Xiaomi’s large‑scale deployment of Iceberg for CDC and log ingestion, explores their streaming‑batch integration experiments, outlines future roadmap items, and provides a comprehensive Q&A covering practical challenges and solutions.

Batch ProcessingBig DataData Lake
0 likes · 23 min read
How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes
ByteDance Terminal Technology
ByteDance Terminal Technology
Nov 18, 2022 · Big Data

Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance

This article presents ByteDance’s experience building a massive trace‑data analysis platform, covering observability fundamentals, the evolution of its distributed tracing system, various aggregation computation models, technical architecture choices, and concrete use‑cases such as precise topology, traffic estimation, dependency analysis, performance anti‑patterns, bottleneck detection, and error propagation.

Big DataDistributed TracingGraph Database
0 likes · 21 min read
Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance
360 Smart Cloud
360 Smart Cloud
Nov 17, 2022 · Databases

Exploring StarRocks Applications, Performance Tests, and Cloud‑Native Integration at 360

This article reviews the practical applications and experimental explorations of StarRocks at 360, describing the cloud‑native lake‑warehouse product Yunzhou, its three‑tier architecture, performance comparisons with Trino using TPCH 100 GB, challenges of Kubernetes integration, and future directions for storage‑compute separation.

Big DataCloud NativeKubernetes
0 likes · 7 min read
Exploring StarRocks Applications, Performance Tests, and Cloud‑Native Integration at 360
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 16, 2022 · Operations

Design and Implementation of a Continuous Performance Optimization and Tracking Platform for Xiaohongshu Services

To curb rising resource costs as Xiaohourshu scales, engineers built a Continuous Performance Optimization & Tracking Platform that continuously profiles services, stores diff‑analyzed data in ClickHouse, automatically detects tiny regressions, links them to code changes, and has already saved and flagged roughly 20,000 CPU cores across search, recommendation and advertising workloads.

Big DataContinuous Monitoringcloud-native
0 likes · 16 min read
Design and Implementation of a Continuous Performance Optimization and Tracking Platform for Xiaohongshu Services
DataFunSummit
DataFunSummit
Nov 15, 2022 · Big Data

Industrial Data Governance: Challenges, Practices, and Insights

Industrial data governance, essential for digital transformation, faces challenges such as data heterogeneity, volume, quality, and integration across the value chain, and the presentation outlines background, practical approaches, strategic thinking, and a phased, demand‑driven model to enhance data quality, assetization, and business value.

Big DataData GovernanceDigital Transformation
0 likes · 24 min read
Industrial Data Governance: Challenges, Practices, and Insights
Java Architect Essentials
Java Architect Essentials
Nov 14, 2022 · Big Data

Efficient Import and Export of Millions of Records Using Apache POI and EasyExcel

This article explains how to handle massive Excel import and export tasks in Java by comparing traditional POI implementations, selecting the appropriate Workbook type based on data volume, and leveraging Alibaba's EasyExcel library together with batch JDBC operations to process over three million rows with minimal memory usage and high performance.

Apache POIBig DataData Export
0 likes · 22 min read
Efficient Import and Export of Millions of Records Using Apache POI and EasyExcel
Huolala Tech
Huolala Tech
Nov 11, 2022 · Big Data

How Huolala Boosted Offline Scheduling Performance: Strategies & Lessons

Huolala’s big‑data offline platform, built from scratch, faced escalating scheduling delays as task instances grew, prompting a series of short‑ and mid‑term optimizations—including zombie task cleanup, retention policies, memory caching, algorithmic tweaks, and high‑availability enhancements—to dramatically reduce dependency computation time and sustain million‑scale daily workloads.

Big DataDistributed Systemsoffline scheduling
0 likes · 12 min read
How Huolala Boosted Offline Scheduling Performance: Strategies & Lessons
Meituan Technology Team
Meituan Technology Team
Nov 10, 2022 · Big Data

Optimizing Spark mapPartitions: Memory Management and Best Practices

The article details how Meituan’s Turing machine‑learning platform cut offline resource use by 80% and task time by 63% through memory‑level techniques such as column pruning, adaptive caching, and a deep dive into Spark’s mapPartitions operator, including source‑code analysis, GC behavior, and a low‑memory batch‑iterator best practice.

Big DataMemory OptimizationPerformance Tuning
0 likes · 19 min read
Optimizing Spark mapPartitions: Memory Management and Best Practices
21CTO
21CTO
Nov 9, 2022 · Operations

How Ctrip Handles Billions of Logs Daily: Real‑Time Monitoring, Clog, CAT & TSDB

This article details Ctrip’s large‑scale log monitoring architecture, covering the overall Overview, the Clog log system, the CAT tracing platform, and the internal TSDB solution, explaining how billions of logs are processed in real time with low latency, high reliability, and efficient querying.

Big DataDistributed SystemsLog Monitoring
0 likes · 12 min read
How Ctrip Handles Billions of Logs Daily: Real‑Time Monitoring, Clog, CAT & TSDB
政采云技术
政采云技术
Nov 8, 2022 · Industry Insights

How Small Big‑Data Frontend Teams Can Thrive: A Survival Guide

This guide outlines the essential concepts of big data, the roles of a front‑end data team, practical workflow steps, platform architecture, industry benchmarks, and actionable strategies for small teams to improve efficiency, visualization capabilities, and digital operations.

Big DataData PlatformData visualization
0 likes · 14 min read
How Small Big‑Data Frontend Teams Can Thrive: A Survival Guide
政采云技术
政采云技术
Nov 8, 2022 · Big Data

User Path Analysis in the Hunyi System: Design, Computation Logic, and StarRocks Implementation

This article explains user path analysis as a method to visualize and optimize user flow, describes its productization in the Hunyi analytics platform, details the underlying computation logic, presents a complex StarRocks SQL solution, discusses performance challenges, and suggests future improvements and recruitment opportunities.

Big DataStarRocksperformance optimization
0 likes · 21 min read
User Path Analysis in the Hunyi System: Design, Computation Logic, and StarRocks Implementation
DataFunSummit
DataFunSummit
Nov 7, 2022 · Big Data

Huolala's Data Governance Practices: Data Quality, Metadata, and Cost Management Platforms

This article details Huolala's end‑to‑end data governance practice, covering the construction of a data governance framework, the implementation of a zero‑code data quality platform, a metadata management platform, and a cost‑governance system that together improve data reliability, reduce waste, and support scalable big‑data operations.

Big DataCost ManagementData Governance
0 likes · 14 min read
Huolala's Data Governance Practices: Data Quality, Metadata, and Cost Management Platforms
Tencent Cloud Developer
Tencent Cloud Developer
Nov 7, 2022 · Big Data

Data Engineering and Data Warehouse Design: Principles, Practices, and Governance

The article outlines comprehensive data‑engineering and warehouse‑design principles—covering collection (four Ws and methods like SDK, point‑code, binlog), reporting strategies, source selection, modeling with fact, aggregation, dimension and model tables, quality checks, and governance practices such as standardized SDKs, metric libraries, automated lineage, and cost optimization—to share actionable experience for any organization.

Big DataData GovernanceETL
0 likes · 32 min read
Data Engineering and Data Warehouse Design: Principles, Practices, and Governance
DataFunSummit
DataFunSummit
Nov 6, 2022 · Artificial Intelligence

Guangfa Group’s Federated Learning Exploration, Platform Construction, and the Book “Federated Learning Principles and Applications”

This article outlines Guangfa Group’s initiatives in privacy computing and federated learning, detailing the development of its federated learning platform, contributions to open‑source FATE, industry standards, various application scenarios such as joint statistics, precise marketing, risk control, cross‑domain verification, and introduces their newly published book on federated learning principles and applications.

Artificial IntelligenceBig DataFATE
0 likes · 23 min read
Guangfa Group’s Federated Learning Exploration, Platform Construction, and the Book “Federated Learning Principles and Applications”
Architects' Tech Alliance
Architects' Tech Alliance
Nov 5, 2022 · Databases

Data Replication: Fundamentals, Technologies, and Future Trends

This article explains the concept of data replication, its three-stage process, key principles of compliance, timeliness, and diversity, various replication methods, layered technologies across storage, operating system, and database levels, emerging cloud and big‑data solutions, and heterogeneous use‑case scenarios.

Big Datadata replicationdatabases
0 likes · 15 min read
Data Replication: Fundamentals, Technologies, and Future Trends
StarRocks
StarRocks
Nov 4, 2022 · Big Data

Building a High‑Performance, Cost‑Effective Cloud Lakehouse with StarRocks and EMR

This article explains how to design and implement a cloud‑native Lakehouse using StarRocks and Tencent Cloud EMR, covering core technical requirements, a five‑layer architecture, data ingestion with Iceberg/Hudi, performance tricks like Z‑order clustering, cost‑control through elastic scaling, and the key product features of EMR StarRocks.

Big DataCloud ComputingEMR
0 likes · 24 min read
Building a High‑Performance, Cost‑Effective Cloud Lakehouse with StarRocks and EMR
dbaplus Community
dbaplus Community
Nov 3, 2022 · Big Data

Why Kafka Stores Data the Way It Does: A Deep Dive into Its Log Architecture

This article thoroughly examines Kafka's storage system, explaining why it uses sequential log writes combined with sparse indexing, how different log formats evolved, and the mechanisms for log retention and compaction that enable high‑throughput, fault‑tolerant streaming at massive scale.

Big DataDistributed SystemsKafka
0 likes · 22 min read
Why Kafka Stores Data the Way It Does: A Deep Dive into Its Log Architecture
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 3, 2022 · Big Data

How Alibaba Cloud’s ODPS Upgrade Redefines Big Data Processing and AI Integration

Alibaba Cloud announced that its ODPS platform has been upgraded into an integrated big‑data solution that supports massive batch jobs, real‑time analytics, and AI workloads, delivering record‑breaking performance and enabling use cases from smart city traffic optimization to accelerated autonomous‑driving model training.

AIBig Dataperformance benchmark
0 likes · 5 min read
How Alibaba Cloud’s ODPS Upgrade Redefines Big Data Processing and AI Integration
Zhongtong Tech
Zhongtong Tech
Nov 3, 2022 · Databases

How ZTO’s Database Operations Platform Evolved from Manual to Intelligent Automation

The article recounts Chen Jianhua’s presentation at the GOPS Global Operations Conference, detailing ZTO’s three‑stage journey in building a database operations platform—from initial automation to self‑service and finally to fine‑grained, data‑driven intelligent management—while sharing lessons and future plans.

AutomationBig DataDatabase operations
0 likes · 4 min read
How ZTO’s Database Operations Platform Evolved from Manual to Intelligent Automation
DataFunSummit
DataFunSummit
Nov 2, 2022 · Big Data

Evolution and Construction of Huolala's Doris‑Based OLAP System

This article details Huolala's journey from a MySQL‑centric analytics pipeline to a multi‑engine OLAP platform built on Doris, covering system architecture, data flow, stage‑wise evolution, engine selection, POC validation, performance tuning, stability measures, and future roadmap for self‑service analytics.

Big DataOLAPdoris
0 likes · 15 min read
Evolution and Construction of Huolala's Doris‑Based OLAP System
DataFunSummit
DataFunSummit
Nov 1, 2022 · Big Data

Case Study of DCMM Standard Implementation at State Grid Tianjin Electric Power

This article details State Grid Tianjin Electric Power's early adoption and successful certification of the national DCMM data management maturity model, outlining background, certification milestones, systematic practices, and lessons learned that illustrate how data governance, architecture, and application strategies drive digital transformation.

Big DataDCMMData Governance
0 likes · 11 min read
Case Study of DCMM Standard Implementation at State Grid Tianjin Electric Power
Java Architect Essentials
Java Architect Essentials
Oct 31, 2022 · Big Data

How to Process 10 GB of Age Data on a 4 GB Machine Using Java

This article walks through generating a 10 GB file of age values, reading it line‑by‑line on a 4 GB RAM, 2‑core machine, measuring single‑thread performance, then redesigning the pipeline with a producer‑consumer model, blocking queues and multithreaded string splitting to dramatically boost CPU utilization and cut processing time while managing memory consumption.

Big DataFile ProcessingMemory Optimization
0 likes · 12 min read
How to Process 10 GB of Age Data on a 4 GB Machine Using Java
Architects' Tech Alliance
Architects' Tech Alliance
Oct 31, 2022 · Industry Insights

What Drives Distributed Storage: Product Forms, Ecosystem, and Key Use Cases

Distributed storage encompasses integrated appliances and pure‑software solutions, each with distinct hardware strategies, and forms a multi‑dimensional industry ecosystem that spans commercial and open‑source software, specialized and generic hardware, serving critical scenarios such as virtualization/cloud, high‑performance computing, and big‑data analytics.

Big DataCloud ComputingHigh‑performance computing
0 likes · 15 min read
What Drives Distributed Storage: Product Forms, Ecosystem, and Key Use Cases