Tagged articles
607 articles
Page 3 of 7
DataFunSummit
DataFunSummit
Oct 12, 2022 · Big Data

Practical Application of Kyuubi in Xiaomi’s Big Data Platform

This article details how Xiaomi integrated the open‑source Kyuubi SQL gateway into its evolving big‑data platform, describing the challenges of multiple SQL services, the architectural redesign for a unified, high‑availability service, performance gains, new features such as engine pooling and Z‑ordering, and future roadmap plans.

Big DataData PlatformKyuubi
0 likes · 15 min read
Practical Application of Kyuubi in Xiaomi’s Big Data Platform
Big Data Technology Architecture
Big Data Technology Architecture
Oct 10, 2022 · Big Data

Integrating Apache Hudi with MinIO: A Comprehensive Tutorial

This tutorial explains how to set up Apache Hudi on cloud‑native object storage with MinIO, covering Hudi’s architecture, file format, timeline, write and read paths, core features, schema evolution, and step‑by‑step Spark commands for ingesting, updating, deleting, and querying data in a streaming data‑lake environment.

Apache HudiMinioSpark
0 likes · 26 min read
Integrating Apache Hudi with MinIO: A Comprehensive Tutorial
Youzan Coder
Youzan Coder
Sep 29, 2022 · Big Data

Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide

This article explains the growing importance of data lineage in large data warehouses, evaluates three Spark lineage extraction approaches, and provides a detailed, step‑by‑step guide to integrating the open‑source Spline agent—including codeless and programmatic initialization, configuration, dispatcher setup, post‑processing, and known limitations.

Apache SparkBig DataData Governance
0 likes · 16 min read
Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide
ITPUB
ITPUB
Sep 22, 2022 · Big Data

What Is a Real‑Time Data Warehouse? Product, Solution, and Use Cases Explained

The article explains the concept of real‑time data warehouses, traces their evolution from early relational databases to modern streaming‑batch engines, discusses whether they are products or solutions, outlines typical application scenarios, selection criteria, and future trends in the big‑data ecosystem.

FlinkSparkcloud
0 likes · 10 min read
What Is a Real‑Time Data Warehouse? Product, Solution, and Use Cases Explained
DataFunSummit
DataFunSummit
Sep 21, 2022 · Big Data

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

This article details NetEase Yanxuan's DMP tag system, covering platform overview, tag definitions, production pipelines, multi‑layer storage architecture, high‑performance query techniques, and future roadmap, illustrating how data from various sources is transformed into actionable user tags for refined operations.

Apache DorisBig DataDMP
0 likes · 10 min read
Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query
ByteDance Cloud Native
ByteDance Cloud Native
Sep 2, 2022 · Big Data

How ByteDance’s Cloud Shuffle Service Boosts Big Data Job Stability and Performance

ByteDance’s Cloud Shuffle Service (CSS) replaces the traditional Pull‑Based Sort Shuffle in Spark, FlinkBatch and MapReduce with a Push‑Based remote shuffle that improves stability, performance and elasticity, supports compute‑storage separation, and delivers significant speedups in large‑scale TPC‑DS benchmarks.

Distributed SystemsPerformance OptimizationRemote Shuffle
0 likes · 11 min read
How ByteDance’s Cloud Shuffle Service Boosts Big Data Job Stability and Performance
Architecture Digest
Architecture Digest
Aug 27, 2022 · Artificial Intelligence

Understanding Collaborative Filtering, Matrix Factorization, and Spark ALS for Recommendation Systems

This article explains the fundamentals of recommendation systems, introduces collaborative filtering (both user‑based and item‑based), derives the matrix‑factorization model with ALS optimization, provides a complete Python implementation, and demonstrates how to apply Spark ALS in both demo and production environments.

ALSSparkcollaborative filtering
0 likes · 29 min read
Understanding Collaborative Filtering, Matrix Factorization, and Spark ALS for Recommendation Systems
Big Data Technology Architecture
Big Data Technology Architecture
Aug 23, 2022 · Big Data

Apache Hudi 0.12.0 Release Highlights: Presto Connector, Archive Beyond Savepoint, File‑System Locks, Deltastreamer Termination, Spark & Flink Support, Performance Improvements, and Configuration Updates

The Apache Hudi 0.12.0 release introduces a native Presto connector, archive‑beyond‑savepoint capability, file‑system based locking, new deltastreamer termination strategies, expanded Spark and Flink support, numerous performance enhancements, and a series of configuration and API updates for better data‑lake management.

Apache HudiFlinkPresto
0 likes · 12 min read
Apache Hudi 0.12.0 Release Highlights: Presto Connector, Archive Beyond Savepoint, File‑System Locks, Deltastreamer Termination, Spark & Flink Support, Performance Improvements, and Configuration Updates
GuanYuan Data Tech Team
GuanYuan Data Tech Team
Aug 18, 2022 · Big Data

Why Spark’s compatiblePartitions Causes CPU Spikes and How to Fix It

The article investigates a Spark driver CPU overload caused by the compatiblePartitions method’s expensive permutation logic in window functions, explains the underlying O(n!) complexity, and presents a simplified implementation that eliminates the issue and has been merged into the official Spark codebase.

CPU optimizationSparkWindow Functions
0 likes · 7 min read
Why Spark’s compatiblePartitions Causes CPU Spikes and How to Fix It
ITPUB
ITPUB
Aug 1, 2022 · Big Data

How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance

This article details Bilibili's evolution from a Hadoop‑based offline platform to a Spark‑driven architecture, covering the Hive‑to‑Spark migration, automated SQL conversion, result validation, stability enhancements, performance tuning, meta‑store federation, and future directions for large‑scale data processing.

Big DataData SkippingHive
0 likes · 31 min read
How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 28, 2022 · Big Data

Spark SQL UNION Causing driver.maxResultSize Error and Its Resolution

When executing a Spark SQL query with dozens of UNION subqueries that each contain JOIN operations on Spark 3.1.2, the job fails because the total serialized result size of the tasks exceeds the driver’s maxResultSize limit, and the issue can be resolved by reducing the initial partition number used by Adaptive Query Execution.

DriverMaxResultSizePerformanceTuningSQL
0 likes · 10 min read
Spark SQL UNION Causing driver.maxResultSize Error and Its Resolution
ITPUB
ITPUB
Jul 23, 2022 · Information Security

How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive

This article details Bilibili's implementation of Apache Ranger for fine‑grained access control across Hadoop, HDFS, Hive, Spark, and Presto, covering architecture, API redesign, admin optimizations, gray‑release strategies, permission pre‑checks, data masking, and future plans for incremental policy loading.

HDFSHivePresto
0 likes · 16 min read
How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive
DataFunTalk
DataFunTalk
Jul 23, 2022 · Artificial Intelligence

Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster

This article presents a comprehensive overview of deploying and running graph learning algorithms—both inductive and transductive—on the secure Spark cluster, covering framework choices, data sampling strategies, distributed training techniques, model evaluation metrics, and future directions.

Big DataDistributed TrainingSpark
0 likes · 13 min read
Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster
Bilibili Tech
Bilibili Tech
Jul 22, 2022 · Information Security

Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform

Bilibili’s data platform redesigns Ranger‑based access control by simplifying HDFS and Hive policy APIs, parallelizing policy loading, adding gray‑release and pre‑check mechanisms, integrating fine‑grained Hive authorization with data‑masking, extending support to Spark and Presto, and planning incremental loading, policy fusion, and a NameNode proxy to boost security and performance.

HDFSHivePresto
0 likes · 15 min read
Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform
vivo Internet Technology
vivo Internet Technology
Jul 20, 2022 · Artificial Intelligence

Collaborative Filtering and Matrix Factorization: Theory and Spark ALS Implementation

The article introduces collaborative filtering, derives the matrix‑factorization model R≈X·Yᵀ with L2‑regularized ALS updates, demonstrates a full Python example on a small rating matrix, then shows how to implement and scale Spark’s ALS for massive user‑item data, ending with production tips and references.

ALSRecommendation SystemsSpark
0 likes · 25 min read
Collaborative Filtering and Matrix Factorization: Theory and Spark ALS Implementation
Big Data Technology Architecture
Big Data Technology Architecture
Jul 15, 2022 · Big Data

Using and Designing the Apache SeaTunnel Examples Module

This article introduces Apache SeaTunnel's Examples module, compares SeaTunnel with DataX, explains its multi‑engine design, demonstrates Flink and Spark example implementations, and shares the speaker's experiences contributing to the open‑source community, providing practical guidance for big‑data integration projects.

Apache SeaTunnelData IntegrationFlink
0 likes · 10 min read
Using and Designing the Apache SeaTunnel Examples Module
GuanYuan Data Tech Team
GuanYuan Data Tech Team
Jul 14, 2022 · Big Data

How to Train Massive GBDT Models on Spark: A Complete Step‑by‑Step Guide

This article walks through using Apache Spark for large‑scale GBDT training, covering the challenges of massive data, Spark deployment, PySpark code examples, differences from Pandas, feature engineering, mmlspark installation, early‑stopping tricks, performance bottlenecks, and a systematic evaluation of alternative frameworks.

Big DataGBDTPerformance Optimization
0 likes · 38 min read
How to Train Massive GBDT Models on Spark: A Complete Step‑by‑Step Guide
dbaplus Community
dbaplus Community
Jul 13, 2022 · Big Data

Unpacking the Core Technologies Behind Modern Big Data Platforms

From data ingestion to real‑time analytics, this guide breaks down the essential layers of a typical big‑data platform—covering collection methods, HDFS storage, Hive/Spark analysis, data sharing mechanisms, application use‑cases, streaming with Spark Streaming, and the need for robust scheduling and monitoring.

Big DataData IntegrationData Warehouse
0 likes · 9 min read
Unpacking the Core Technologies Behind Modern Big Data Platforms
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 12, 2022 · Big Data

Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging

This article explains how Spark reads data from Apache Iceberg tables by parsing snapshots and manifest files into DataFile objects, creates Batch and InputPartition objects, uses readers to materialize InternalRows, and then demonstrates how Iceberg's RewriteDataFilesAction can merge tiny Parquet files into larger ones through Spark‑driven tasks.

Big DataData LakeIceberg
0 likes · 17 min read
Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging
Hulu Beijing
Hulu Beijing
Jul 7, 2022 · Big Data

How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration

This article details Hulu's five‑year journey from Hadoop 2.6 to 3.3.2, covering major feature evolutions, the original cluster architecture, a comprehensive upgrade plan, compatibility challenges across HDFS, YARN, Hive, Spark and Flink, and the testing and rollout strategies that ensured a smooth migration.

Big DataCluster UpgradeCompatibility
0 likes · 17 min read
How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration
GuanYuan Data Tech Team
GuanYuan Data Tech Team
Jun 30, 2022 · Big Data

Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics

After upgrading Spark from 3.0.1 to 3.2.1 an ETL job began failing with OutOfMemory errors; this article examines the root causes, including AQE‑related metric accumulation, skipped stages, and stage‑metric growth, and presents a debugging process and a code‑level fix to mitigate memory pressure.

AQEBig DataOutOfMemory
0 likes · 13 min read
Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics
DataFunTalk
DataFunTalk
Jun 28, 2022 · Big Data

JD Retail Traffic Data Warehouse Architecture and Processing Practices

This article presents a comprehensive technical overview of JD.com’s retail traffic data processing pipeline, detailing the multi‑layer data warehouse architecture, real‑time and offline data flows, a large‑scale back‑fill case using Iceberg and OLAP, data‑skew detection and mitigation techniques, and future directions involving unified Flink‑Spark streaming‑batch solutions.

Data SkewFlinkIceberg
0 likes · 12 min read
JD Retail Traffic Data Warehouse Architecture and Processing Practices
ITPUB
ITPUB
Jun 25, 2022 · Big Data

How Spark SQL’s Catalyst Optimizer Accelerates Big Data Queries

This article explains Apache Spark’s role in large‑scale data processing, traces the evolution from Shark to Spark SQL’s DataFrame and Dataset APIs, and details the internal Catalyst optimizer—including its rule‑based and cost‑based strategies—through step‑by‑step examples and code snippets.

CatalystDatasetSQL
0 likes · 11 min read
How Spark SQL’s Catalyst Optimizer Accelerates Big Data Queries
Big Data Technology Architecture
Big Data Technology Architecture
Jun 8, 2022 · Big Data

Bilibili Offline Computing Platform: Migration from Hive to Spark and Comprehensive Performance Optimizations

The article details Bilibili's evolution of its offline computing platform from Hadoop‑based Hive to Spark, describing migration tools, SQL conversion, result and resource comparison, shuffle stability, small‑file handling, runtime filters, data skipping, ZSTD support, Hive Metastore federation, traffic control, and future optimization directions.

Data MigrationHiveResource Management
0 likes · 29 min read
Bilibili Offline Computing Platform: Migration from Hive to Spark and Comprehensive Performance Optimizations
Bilibili Tech
Bilibili Tech
May 31, 2022 · Big Data

Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices

Bilibili migrated its massive offline platform from Hive to Spark using an automated SQL rewrite and dual‑run verification, cutting execution time over 40% and resource use 30%, while introducing small‑file merging, shuffle stability, runtime filters, data‑skipping, lineage tracking, auto‑parameter tuning, and metastore federation for robust large‑scale processing.

Big DataHiveSpark
0 likes · 30 min read
Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices
Architecture Digest
Architecture Digest
May 23, 2022 · Big Data

Overview of Core Technologies in a Big Data Platform Architecture

This article explains the main layers of a typical big data platform—data collection, storage and analysis, sharing, and application—detailing common tools such as Flume, DataX, Hive, Spark, SparkSQL, Impala, and Spark Streaming, and discusses task scheduling and monitoring in the ecosystem.

Data PlatformDataXHadoop
0 likes · 10 min read
Overview of Core Technologies in a Big Data Platform Architecture
Alibaba Cloud Developer
Alibaba Cloud Developer
May 17, 2022 · Artificial Intelligence

How Databricks and Prophet Power Retail Demand Forecasting for Store‑Item Sales

This article walks through why accurate demand forecasting is critical for retailers, shows how to prepare and visualize sales data, demonstrates building a store‑item model with Databricks DDI and Facebook Prophet, and explains scaling the model to predict every product across all stores, highlighting performance metrics and practical tips.

DatabricksProphetSpark
0 likes · 7 min read
How Databricks and Prophet Power Retail Demand Forecasting for Store‑Item Sales
Big Data Technology & Architecture
Big Data Technology & Architecture
May 17, 2022 · Big Data

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

This article provides a comprehensive guide to Apache Hudi, covering its basic concepts, timeline architecture, storage types (Copy‑On‑Write and Merge‑On‑Read), write operations, DeltaStreamer usage, Hive/Spark/Presto query integration, data management, indexing, compaction, and best‑practice recommendations for big‑data lake workloads.

Apache HudiBig DataCopy-on-Write
0 likes · 43 min read
Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management
Baidu Geek Talk
Baidu Geek Talk
May 9, 2022 · Big Data

How a Spark Offline Framework Boosts Data Backtracking Efficiency

This article introduces a Spark offline development framework that separates configuration from code, supports SQL and Java applications, and provides fast, automated data backtracking with reduced environment preparation time, lower failure rates, and significant performance gains for large‑scale data warehouses.

Big DataData BacktrackingJava
0 likes · 17 min read
How a Spark Offline Framework Boosts Data Backtracking Efficiency
Big Data Technology & Architecture
Big Data Technology & Architecture
May 4, 2022 · Big Data

Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities

The Apache Hudi 0.11.0 release introduces multi‑mode metadata indexing, enhanced data‑skipping, asynchronous indexing, extensive Spark and Flink integration improvements, new bundle utilities, and expanded metadata synchronization with BigQuery, AWS Glue, and DataHub, while also adding bucket indexing and encryption support.

Apache HudiAsync IndexBig Data
0 likes · 13 min read
Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities
ITPUB
ITPUB
Apr 26, 2022 · Big Data

Mastering Delta Lake: From Data Lake Basics to Hands‑On Implementation

This article explains the fundamentals of data lakes and data warehouses, compares their architectures, outlines the challenges of data lakes, and then dives deep into Delta Lake's core features, storage model, ACID guarantees, concurrency handling, and provides step‑by‑step Spark code examples for practical use.

ACIDCopy-on-WriteData Lake
0 likes · 18 min read
Mastering Delta Lake: From Data Lake Basics to Hands‑On Implementation
ITPUB
ITPUB
Apr 19, 2022 · Big Data

Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive

This article explains why modern enterprises need real‑time data‑warehouse architectures, breaks down traditional layered warehouse concepts, compares Lambda and Kappa models, evaluates five practical real‑time solutions—including Iceberg‑based lakehouse and MPP databases—provides code snippets, and offers selection guidance with real‑world company examples.

Big DataFlinkIceberg
0 likes · 19 min read
Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive
JavaEdge
JavaEdge
Apr 17, 2022 · Big Data

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

The article explains how Spark, developed by UC Berkeley's AMP Lab, quickly surpassed MapReduce by offering faster execution, a simpler Scala‑based programming model, lazy RDD transformations, a rich ecosystem including SQL, Streaming, MLlib and GraphX, and practical code examples such as a three‑line WordCount.

Big DataMapReduceRDD
0 likes · 7 min read
Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model
DataFunSummit
DataFunSummit
Apr 16, 2022 · Big Data

Angel Graph: A Scalable Graph Computing Platform – Architecture, Optimizations, and Applications

The article introduces Angel Graph, a large‑scale graph computing platform built on Angel's parameter‑server architecture and Spark, detailing its evolution, framework components (including Spark‑on‑Angel and PyTorch‑on‑Angel), data and model partitioning strategies, communication and computation optimizations, stability mechanisms, usability features, and real‑world applications across recommendation, risk control, social and gaming domains.

Parameter ServerPyTorchSpark
0 likes · 15 min read
Angel Graph: A Scalable Graph Computing Platform – Architecture, Optimizations, and Applications
Zuoyebang Tech Team
Zuoyebang Tech Team
Apr 13, 2022 · Big Data

How Delta Lake Transformed Our Offline Data Warehouse Performance

This article details how ZuoYeBang's engineering team migrated their Hive‑based offline data warehouse to Delta Lake, tackling latency, scalability, and query‑performance challenges through stream‑to‑batch processing, data‑lake architecture, and optimizations like DPP and Z‑ordering.

Big DataDelta LakeHive
0 likes · 15 min read
How Delta Lake Transformed Our Offline Data Warehouse Performance
DataFunTalk
DataFunTalk
Apr 10, 2022 · Big Data

Angel Graph: A Large-Scale Graph Computing Platform by Tencent

This article introduces Tencent's Angel Graph platform, detailing its evolution from early versions to a mature large‑scale graph computing system, its architecture combining Angel PS with Spark and PyTorch, data and model partitioning strategies, communication and computation optimizations, stability features, usability, and real‑world applications.

Angel GraphGraph Neural NetworkSpark
0 likes · 15 min read
Angel Graph: A Large-Scale Graph Computing Platform by Tencent
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 31, 2022 · Big Data

Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg

To address the high cost and low efficiency of traditional Hadoop‑based data pipelines, Bilibili designed a lakehouse solution using Apache Iceberg, integrating Spark, Flink, Trino, and Alluxio to unify flexible data lake storage with warehouse‑level query performance, reducing data duplication and improving interactive analytics.

Big DataData WarehouseIceberg
0 likes · 17 min read
Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg
IT Services Circle
IT Services Circle
Mar 21, 2022 · Big Data

Understanding Spark Shuffle: Hash, Sort, and Tungsten Sort Mechanisms

This article explains the evolution and inner workings of Spark's shuffle phase, comparing the original Hash‑based shuffle, the default Sort‑based shuffle, the optimized Tungsten‑Sort shuffle, and related configuration options that affect performance and file handling in large‑scale data processing.

Hash ShuffleShuffleSort-Shuffle
0 likes · 17 min read
Understanding Spark Shuffle: Hash, Sort, and Tungsten Sort Mechanisms
vivo Internet Technology
vivo Internet Technology
Feb 23, 2022 · Big Data

Kafka-based Real-Time Data Warehouse: Architecture and Practice for Search

The article explains how Kafka serves as the core of a real‑time data warehouse for search, detailing its advantages over traditional databases, integration with Flink for low‑latency stream processing, architectural patterns such as Lambda/Kappa, scaling challenges, and comprehensive monitoring using Kafka Eagle.

Apache KafkaData IntegrationFlink
0 likes · 15 min read
Kafka-based Real-Time Data Warehouse: Architecture and Practice for Search
DataFunTalk
DataFunTalk
Feb 12, 2022 · Big Data

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

This article introduces NetEase's internally incubated data lake project Arctic, explains the concept of data lakes, outlines NetEase's specific requirements for a unified streaming‑batch platform, details Arctic's core architecture, storage strategy, data‑merge mechanisms, current achievements, and future development plans.

Apache IcebergArcticBig Data
0 likes · 10 min read
NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap
JD Retail Technology
JD Retail Technology
Feb 11, 2022 · Big Data

Runtime Filter Join Optimization in JD Spark Using Bloom Filters

This article details JD Spark's Runtime Filter Join optimization, which leverages Bloom filters to prune large‑table data before shuffle, reducing I/O and execution time across batch and real‑time workloads, and presents architecture, implementation challenges, code examples, and performance gains in both benchmark and production environments.

Runtime Filter JoinShuffle ReductionSpark
0 likes · 15 min read
Runtime Filter Join Optimization in JD Spark Using Bloom Filters
IT Architects Alliance
IT Architects Alliance
Feb 8, 2022 · Backend Development

Designing a Daily Million-Transaction Payment Reconciliation System

This article explains how to architect a payment reconciliation system that can reliably process tens of millions of transactions per day, covering the underlying logic, scalability challenges, data collection methods, big‑data integration, and step‑by‑step processing flows to ensure accurate financial matching.

Backend ArchitectureBig DataHive
0 likes · 32 min read
Designing a Daily Million-Transaction Payment Reconciliation System
DataFunTalk
DataFunTalk
Feb 3, 2022 · Big Data

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

This article explains how Kuashou tackled latency and efficiency problems in large‑scale data pipelines by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, its architecture, model design, handling of bursty updates, back‑fill scenarios, and operational safeguards.

Big DataData LakeFlink
0 likes · 13 min read
Improving Data Processing Efficiency at Kuaishou with Apache Hudi
JD Retail Technology
JD Retail Technology
Jan 27, 2022 · Big Data

How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs

This article explains JD’s self‑developed data‑skew mitigation solution for Spark, detailing the problem of uneven key distribution, the limitations of the open‑source AQE implementation, and JD’s OptimizeSkewedJoinV2 algorithm that dramatically reduces stage latency in large‑scale join workloads.

Adaptive Query ExecutionBig DataData Skew
0 likes · 13 min read
How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs
DataFunTalk
DataFunTalk
Jan 27, 2022 · Big Data

Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing

This article introduces Kyuubi, the first NetEase project contributed to the Apache Foundation, describing its core features, multi‑tenant architecture, Spark‑based execution engine, cloud‑native capabilities, and real‑world use cases within NetEase’s data‑warehouse, ad‑hoc, and internal systems, along with performance gains and community resources.

ApacheBig DataKyuubi
0 likes · 23 min read
Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing
Architect
Architect
Jan 7, 2022 · Big Data

Spark Performance Optimization: Principles, Memory Model, Resource Tuning, Data Skew and Shuffle Tuning

This article provides an in‑depth guide to Spark performance optimization, covering the ten development principles, static and unified memory models, resource parameter tuning, data skew detection and mitigation techniques, as well as shuffle‑related configuration adjustments, supplemented with practical code examples and diagrams.

Data SkewMemory ModelShuffle
0 likes · 40 min read
Spark Performance Optimization: Principles, Memory Model, Resource Tuning, Data Skew and Shuffle Tuning
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 31, 2021 · Big Data

Apache SeaTunnel Joins the Apache Incubator: Overview, Features, and Real‑World Use Cases

SeaTunnel, the China‑originated data‑integration platform built on Spark and Flink, has been accepted into the Apache Incubator, and this article introduces its history, architecture, plugin ecosystem, deployment requirements, and numerous enterprise deployments across batch and streaming big‑data scenarios.

ApacheBig DataData Integration
0 likes · 7 min read
Apache SeaTunnel Joins the Apache Incubator: Overview, Features, and Real‑World Use Cases
DataFunTalk
DataFunTalk
Dec 25, 2021 · Artificial Intelligence

Optimizing Spark‑ML Linear Models with Project Matrix: Background, Progress, and Future Plans

This article introduces the Project Matrix initiative that re‑examines and restructures Spark‑ML linear models, detailing the background of Spark‑ML usage at JD, the performance‑focused optimizations such as blockification and virtual centering, and outlines upcoming work to further improve scalability and accuracy.

Big DataPerformance OptimizationSpark
0 likes · 9 min read
Optimizing Spark‑ML Linear Models with Project Matrix: Background, Progress, and Future Plans
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 23, 2021 · Big Data

Key Spark Configuration Parameters and Their Explanations

This article presents a comprehensive list of essential Spark configuration settings—including executor memory, off‑heap memory, memory fractions, shuffle options, and adaptive query execution parameters—each accompanied by a concise description to help users fine‑tune Spark performance.

Adaptive Query ExecutionBig DataMemory Management
0 likes · 6 min read
Key Spark Configuration Parameters and Their Explanations
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 21, 2021 · Big Data

Understanding Spark 3.0 Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP)

This article explains the two most important Spark 3.0 features—Adaptive Query Execution and Dynamic Partition Pruning—detailing how AQE dynamically optimizes join strategies, partition coalescing, and skew handling, while DPP reduces I/O by pruning irrelevant fact‑table partitions at runtime.

Adaptive Query ExecutionBig DataDynamic Partition Pruning
0 likes · 10 min read
Understanding Spark 3.0 Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP)
JD Cloud Developers
JD Cloud Developers
Dec 15, 2021 · Big Data

How JD Retail Scales Billion‑Item Selection with ClickHouse & Elasticsearch

This article details JD Retail's strategic "Nirvana" product‑selection platform, describing the technical challenges of handling billions of items and hundreds of tags, and presenting a dual‑engine solution using ClickHouse and Elasticsearch with Spark‑driven data pipelines to achieve fast filtering, multidimensional analytics, and efficient storage.

Big DataClickHouseElasticsearch
0 likes · 15 min read
How JD Retail Scales Billion‑Item Selection with ClickHouse & Elasticsearch
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 6, 2021 · Big Data

Understanding Spark’s Memory Model: Unified Memory Management, On‑Heap and Off‑Heap Memory, and Configuration

This article explains Spark’s unified memory management model, detailing the division between on‑heap and off‑heap memory, the roles of execution, storage, user, and reserved memory, configuration parameters, dynamic allocation, and how these concepts affect performance and resource utilization.

Execution MemoryMemory ManagementOff-Heap
0 likes · 17 min read
Understanding Spark’s Memory Model: Unified Memory Management, On‑Heap and Off‑Heap Memory, and Configuration
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 4, 2021 · Big Data

Understanding Spark's BlockManager, MemoryStore, and DiskStore

This article explains Spark's storage architecture, detailing the roles and interactions of BlockManager, MemoryStore, and DiskStore, including their initialization, data management mechanisms, code implementations, and eviction strategies, to help readers grasp how Spark efficiently handles in‑memory and on‑disk data.

Big DataBlockManagerDiskStore
0 likes · 12 min read
Understanding Spark's BlockManager, MemoryStore, and DiskStore
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 1, 2021 · Big Data

Understanding Spark Shuffle: Mechanisms, Evolution, and Optimization

This article provides a comprehensive overview of Spark's shuffle process, explaining its definition, internal mechanisms such as shuffle write and read, the evolution of shuffle managers, and practical optimization techniques including parameter tuning and broadcast variables, all aimed at improving performance in large‑scale data processing.

Big DataShuffleShuffle Reader
0 likes · 18 min read
Understanding Spark Shuffle: Mechanisms, Evolution, and Optimization
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 22, 2021 · Big Data

Comprehensive Big Data Learning Path and Resource Guide

This article presents a detailed learning roadmap for aspiring big‑data experts, covering foundational programming languages, data structures, Linux basics, databases, distributed system theory, and essential frameworks such as Hadoop, Spark, Flink, Kafka, and provides curated B‑site video links and reference materials.

Big DataFlinkHadoop
0 likes · 9 min read
Comprehensive Big Data Learning Path and Resource Guide
DataFunTalk
DataFunTalk
Nov 20, 2021 · Big Data

How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices

This article provides a comprehensive guide to designing and implementing a big‑data platform, covering architecture overview, data ingestion with Flume, storage on HDFS/Hive/HBase, processing engines such as Hive, Spark and Flink, scheduling solutions like Azkaban and Airflow, and the construction of self‑service analytics systems.

Big DataETLHadoop
0 likes · 29 min read
How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices
Big Data Technology Architecture
Big Data Technology Architecture
Nov 16, 2021 · Big Data

Understanding Adaptive Query Execution and Dynamic Partition Pruning in Apache Spark 3.0

This article explains how Apache Spark 3.0 improves SQL workload performance through Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP), detailing their design principles, runtime optimizations, configuration parameters, and practical examples that demonstrate reduced shuffle partitions, smarter join strategies, and handling of data skew.

Adaptive Query ExecutionDynamic Partition PruningSQL Optimization
0 likes · 9 min read
Understanding Adaptive Query Execution and Dynamic Partition Pruning in Apache Spark 3.0
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 8, 2021 · Big Data

Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices

This article examines the strengths and weaknesses of Apache Iceberg, explains why Tencent selected it over alternatives, details Tencent’s own enhancements and integration with Flink, Spark, and other engines, and shares multiple real‑world implementations for building enterprise‑grade real‑time data lakes.

Apache IcebergData LakeFlink
0 likes · 17 min read
Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 23, 2021 · Big Data

Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage

This article provides a comprehensive overview of Hive's execution engines—including MapReduce, Tez, and Spark—detailing their architectures, the six-stage Hive SQL compilation process, practical Explain syntax examples, and extensive tuning parameters for each engine to improve performance in big‑data environments.

HiveMapReduceSQL Optimization
0 likes · 48 min read
Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage
Java High-Performance Architecture
Java High-Performance Architecture
Oct 12, 2021 · Big Data

Unpacking the Core Technologies Behind Modern Big Data Platforms

This article breaks down a typical big data platform architecture into its four layers—data collection, storage and analysis, sharing, and real‑time computation—detailing the essential tools such as Flume, HDFS, Hive, Spark, DataX, and task scheduling systems that enable scalable, low‑latency data processing and delivery.

Big DataData ArchitectureDataX
0 likes · 8 min read
Unpacking the Core Technologies Behind Modern Big Data Platforms
Architecture Digest
Architecture Digest
Oct 11, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

This article explains the typical architecture of a big‑data platform, detailing its four core layers—data collection, storage & analysis, data sharing, and application—and describing the key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and task scheduling components.

Big DataData ArchitectureDataX
0 likes · 8 min read
Core Technologies and Architecture of a Big Data Platform
Big Data Technology Architecture
Big Data Technology Architecture
Sep 28, 2021 · Big Data

Integrating Apache Kyuubi with CDH 6 and Spark 3: Deployment, Configuration, and Performance Tuning

This guide explains how to deploy Apache Kyuubi on a CDH 6 cluster, replace HiveServer2 with Kyuubi, integrate Spark 3, apply necessary patches, configure environment and Spark settings, and optimize engine sharing for various workloads, providing complete code snippets and step‑by‑step instructions.

CDHHiveServer2Kyuubi
0 likes · 19 min read
Integrating Apache Kyuubi with CDH 6 and Spark 3: Deployment, Configuration, and Performance Tuning
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 13, 2021 · Big Data

Understanding Bytecode, Code Generation, Serialization, and Data Processing Techniques in Spark and Flink

This article explains how bytecode and code‑generation improve Spark SQL performance, compares Java I/O and MapReduce InputFormats, reviews serialization choices in Spark and Flink, and describes reflection‑based DataFrame creation, storage‑memory eviction, fail‑fast design, and ConcurrentHashMap usage in big‑data frameworks.

Code GenerationFlinkJava
0 likes · 11 min read
Understanding Bytecode, Code Generation, Serialization, and Data Processing Techniques in Spark and Flink
Ctrip Technology
Ctrip Technology
Sep 9, 2021 · Big Data

Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications

This article describes how Ctrip built a data lineage system for its big data platform, covering the concept of data lineage, collection methods, open‑source tools such as Apache Atlas and DataHub, the in‑house table‑level and field‑level solutions, implementation details for Hive, Spark and Presto, storage in JanusGraph, and practical applications in data governance, metadata management, scheduling and sensitivity labeling.

Big DataHiveJanusGraph
0 likes · 16 min read
Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications
IT Architects Alliance
IT Architects Alliance
Sep 5, 2021 · Big Data

Big Data Platform Architecture: Core Layers, Technologies, and Practices

This article outlines a typical big data platform architecture, detailing its core layers—data acquisition, storage and analysis, sharing, application, real‑time computation, and task scheduling—while introducing key technologies such as Flume, HDFS, Hive, Spark, DataX, and monitoring considerations.

Big DataData PlatformHadoop
0 likes · 9 min read
Big Data Platform Architecture: Core Layers, Technologies, and Practices
Architects' Tech Alliance
Architects' Tech Alliance
Sep 2, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

The article outlines a typical big data platform architecture, detailing its core layers—data collection, storage and analysis, sharing, application, real-time computation, and task scheduling—while describing key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and Redis.

Data ArchitectureData IntegrationHadoop
0 likes · 9 min read
Core Technologies and Architecture of a Big Data Platform
Big Data Technology Architecture
Big Data Technology Architecture
Aug 24, 2021 · Big Data

Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting

This article presents a detailed collection of Spark performance‑tuning techniques—including submit‑script parameters, RDD and operator optimizations, parallelism and memory settings, broadcast variables, Kryo serialization, locality wait adjustments—as well as systematic methods for detecting and resolving data skew and common runtime issues such as shuffle failures, serialization errors, and JVM memory problems.

Data SkewShuffleSpark
0 likes · 21 min read
Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting
Big Data Technology Architecture
Big Data Technology Architecture
Aug 12, 2021 · Big Data

Enterprise Data Lake Architecture, Delta Lake Core Capabilities, and Stream‑Batch Integrated Analytics on Alibaba Cloud

This article explains the rapid growth of data, the limitations of traditional warehouses, and how a cloud‑based data lake built on object storage with Delta Lake format provides low‑cost, flexible, and ACID‑compliant analytics, followed by a step‑by‑step guide to ingest, manage, and analyze data using Alibaba Cloud DLF and Databricks DDI with Spark streaming and batch jobs.

Alibaba CloudDelta LakeSpark
0 likes · 19 min read
Enterprise Data Lake Architecture, Delta Lake Core Capabilities, and Stream‑Batch Integrated Analytics on Alibaba Cloud
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 2, 2021 · Big Data

Comprehensive Big Data Interview Question Guide for Major Tech Companies

This article compiles extensive interview questions and topics covering Hadoop, Spark, Flink, Hive, Kafka, MySQL, Redis, Java fundamentals, and algorithms, organized by companies such as Xiaomi, ByteDance, Alibaba, Shopee, Tencent, Meituan, NetEase, and Baidu, to help candidates prepare effectively for big‑data engineering roles.

Big DataFlinkHadoop
0 likes · 22 min read
Comprehensive Big Data Interview Question Guide for Major Tech Companies
Big Data Technology Architecture
Big Data Technology Architecture
Jul 27, 2021 · Big Data

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

This article introduces the most important and still mainstream components of the big data ecosystem—including Hadoop’s storage and compute framework, Hive data warehouse, HBase NoSQL database, Spark unified engine, Kafka messaging platform, and Elasticsearch search engine—explaining their core concepts, architectures, and typical use cases.

Big DataElasticsearchHBase
0 likes · 9 min read
Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch