Tagged articles

Spark

623 articles · Page 3 of 7

Nov 28, 2022 · Cloud Native

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

ByteDance’s cloud‑native computing team, led by Li Yakun, details how they transformed a Hadoop‑centric big‑data stack into a Kubernetes‑driven platform—customizing storage, middleware, and scheduling—to support petabyte‑scale workloads, achieve over 40% resource utilization, and sustain rapid product growth.

Big DataCloud NativeResource Scheduling

0 likes · 17 min read

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

ITPUB

Nov 18, 2022 · Big Data

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

This article introduces Iceberg’s table‑format fundamentals, details Xiaomi’s large‑scale deployment of Iceberg for CDC and log ingestion, explores their streaming‑batch integration experiments, outlines future roadmap items, and provides a comprehensive Q&A covering practical challenges and solutions.

Batch ProcessingBig DataData Lake

0 likes · 23 min read

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

Data Thinking Notes

Nov 15, 2022 · Operations

Why Is Airflow Draining CPU? A Step‑by‑Step Diagnosis and Fix

A high‑CPU anomaly on a Spark‑enabled machine was traced through application checks, network TIME_WAIT analysis, and Airflow inspection, leading to kernel tweaks and an Airflow configuration change that finally restored normal CPU usage.

CPULinuxSpark

0 likes · 4 min read

Why Is Airflow Draining CPU? A Step‑by‑Step Diagnosis and Fix

Meituan Technology Team

Nov 10, 2022 · Big Data

Optimizing Spark mapPartitions: Memory Management and Best Practices

The article details how Meituan’s Turing machine‑learning platform cut offline resource use by 80% and task time by 63% through memory‑level techniques such as column pruning, adaptive caching, and a deep dive into Spark’s mapPartitions operator, including source‑code analysis, GC behavior, and a low‑memory batch‑iterator best practice.

Big DataMemory optimizationPerformance Tuning

0 likes · 19 min read

Optimizing Spark mapPartitions: Memory Management and Best Practices

Data Thinking Notes

Nov 8, 2022 · Big Data

Effective Spark GC Tuning: Experiments, Results, and Best Practices

This article walks through a Spark job’s garbage‑collection tuning workflow, presents step‑by‑step experiments with different JVM options and collectors, compares performance under tight and normal memory conditions, and offers practical recommendations for choosing the optimal GC strategy in big‑data workloads.

GCSparkbig-data

0 likes · 12 min read

Effective Spark GC Tuning: Experiments, Results, and Best Practices

dbaplus Community

Oct 30, 2022 · Big Data

Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs

This article explains the importance of layering in data warehouse modeling, outlines the four ETL steps, describes common pitfalls, presents a typical technical stack, and details each warehouse layer (ODS, DWD, DWS, ADS) along with best‑practice naming conventions and implementation tips for big‑data environments.

ETLHiveSpark

0 likes · 38 min read

Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs

DataFunSummit

Oct 30, 2022 · Big Data

Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF

This article examines the challenges of traditional Spark clusters and explains how integrating Spark with cloud‑native platforms—through Kubernetes deployment modes, EMR on ACK practices, Remote Shuffle Service, and serverless Spark on DLF—provides elastic scaling, lower operational costs, and advanced features such as executor rolling and custom scheduler support.

Big DataDLFServerless

0 likes · 18 min read

Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF

DataFunSummit

Oct 29, 2022 · Big Data

Apache Iceberg in Tencent: Architecture, Spark Read/Write, Production Practices, and Data Governance

This article presents an in‑depth overview of Apache Iceberg as used at Tencent, covering its table format architecture, Spark read/write mechanisms, production challenges and optimizations such as schema evolution, file filtering, upsert strategies, and the surrounding data‑governance services.

Apache IcebergBig DataData Governance

0 likes · 19 min read

Apache Iceberg in Tencent: Architecture, Spark Read/Write, Production Practices, and Data Governance

Xingsheng Youxuan Technology Community

Oct 21, 2022 · Big Data

How We Cut Hudi Data Lake Write Costs by Over 85% with Custom Architecture

This article examines the challenges of using Apache Hudi for real‑time data lake writes, analyzes the COW and MOR write models, and presents a custom master‑worker architecture with index optimization and repartitioning that reduces write resource consumption by over 85% while boosting throughput up to 300‑fold.

COWData LakeHudi

0 likes · 14 min read

How We Cut Hudi Data Lake Write Costs by Over 85% with Custom Architecture

Bilibili Tech

Oct 21, 2022 · Big Data

Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing

Bilibili adopted the open‑source Kyuubi proxy to replace its unstable STS layer, enabling multi‑tenant, multi‑engine (Spark, Presto, Flink) SQL/Scala processing with Hive Thrift compatibility, fine‑grained queue isolation, UI monitoring, stability safeguards, and Kubernetes/YARN deployment, while planning further cloud‑native extensions.

Big DataKyuubiSQL

0 likes · 20 min read

Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing

Hulu Beijing

Oct 21, 2022 · Big Data

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

Hulu’s data platform team describes how they migrated large‑scale Spark workloads from Yarn to native Spark on Kubernetes, leveraging AWS services such as EKS, S3, and custom operators to achieve dynamic scaling, unified monitoring, cost‑effective resource management, and improved stability for search, recommendation, and advertising pipelines.

AWSBig DataCloud Native

0 likes · 18 min read

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

DataFunSummit

Oct 18, 2022 · Big Data

Feature Overview of Apache Kyuubi (Incubating) v1.5.0

The article presents a detailed technical walkthrough of Apache Kyuubi 1.5.0, covering its service‑oriented architecture, high‑availability design, multi‑engine extensions for Spark, Flink, Trino and Hive, enhanced engine‑sharing policies, POOL mode configuration, and the project’s future roadmap.

Apache KyuubiBig DataEngine Architecture

0 likes · 13 min read

Feature Overview of Apache Kyuubi (Incubating) v1.5.0

NetEase LeiHuo UX Big Data Technology

Oct 17, 2022 · Big Data

Understanding Data Skew and Its Mitigation Strategies in Distributed Computing

The article explains what data skew is in distributed computing, analyzes its logical and data‑level causes, and presents preventive and remedial techniques such as data partitioning, logical replacement, two‑stage aggregation, increasing parallelism, and data cleaning to improve processing efficiency.

Data SkewPerformance OptimizationSpark

0 likes · 8 min read

Understanding Data Skew and Its Mitigation Strategies in Distributed Computing

Xingsheng Youxuan Technology Community

Oct 14, 2022 · Big Data

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

This article explains how an e‑commerce company designed and implemented a modern data warehouse—combining batch Spark jobs, real‑time Flink streams, and Hudi data‑lake storage—to handle terabytes of daily logs, ensure data quality, and provide fast, reliable analytics for business decision‑making.

Data LakeData WarehouseETL

0 likes · 16 min read

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

Big Data Technology & Architecture

Oct 13, 2022 · Big Data

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

This guide details how to execute Apache Hudi file clustering after a batch job and before streaming, using Spark commands to merge numerous small HDFS files into larger ones, configure clustering and cleaning policies, and verify the results with HDFS counts.

Apache HudiBig DataData Lake

0 likes · 15 min read

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

DataFunSummit

Oct 12, 2022 · Big Data

Practical Application of Kyuubi in Xiaomi’s Big Data Platform

This article details how Xiaomi integrated the open‑source Kyuubi SQL gateway into its evolving big‑data platform, describing the challenges of multiple SQL services, the architectural redesign for a unified, high‑availability service, performance gains, new features such as engine pooling and Z‑ordering, and future roadmap plans.

Big DataData PlatformKyuubi

0 likes · 15 min read

Practical Application of Kyuubi in Xiaomi’s Big Data Platform

Big Data Technology Architecture

Oct 10, 2022 · Big Data

Integrating Apache Hudi with MinIO: A Comprehensive Tutorial

This tutorial explains how to set up Apache Hudi on cloud‑native object storage with MinIO, covering Hudi’s architecture, file format, timeline, write and read paths, core features, schema evolution, and step‑by‑step Spark commands for ingesting, updating, deleting, and querying data in a streaming data‑lake environment.

Apache HudiSparkminio

0 likes · 26 min read

Integrating Apache Hudi with MinIO: A Comprehensive Tutorial

Youzan Coder

Sep 29, 2022 · Big Data

Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide

This article explains the growing importance of data lineage in large data warehouses, evaluates three Spark lineage extraction approaches, and provides a detailed, step‑by‑step guide to integrating the open‑source Spline agent—including codeless and programmatic initialization, configuration, dispatcher setup, post‑processing, and known limitations.

Apache SparkBig DataData Governance

0 likes · 16 min read

Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide

ITPUB

Sep 22, 2022 · Big Data

What Is a Real‑Time Data Warehouse? Product, Solution, and Use Cases Explained

The article explains the concept of real‑time data warehouses, traces their evolution from early relational databases to modern streaming‑batch engines, discusses whether they are products or solutions, outlines typical application scenarios, selection criteria, and future trends in the big‑data ecosystem.

CloudFlinkSpark

0 likes · 10 min read

What Is a Real‑Time Data Warehouse? Product, Solution, and Use Cases Explained

DataFunSummit

Sep 21, 2022 · Big Data

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

This article details NetEase Yanxuan's DMP tag system, covering platform overview, tag definitions, production pipelines, multi‑layer storage architecture, high‑performance query techniques, and future roadmap, illustrating how data from various sources is transformed into actionable user tags for refined operations.

Apache DorisBig DataDMP

0 likes · 10 min read

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

Past Memory Big Data

Sep 13, 2022 · Databases

Velox: An Open‑Source Unified Execution Engine for Data Systems

Velox is Meta's open‑source unified execution engine that consolidates common data‑intensive components, integrates with engines like Presto, Spark, and TorchArrow, and delivers up to ten‑fold speedups on CPU‑bound queries while simplifying development and fostering a reusable, community‑driven ecosystem.

Data ManagementSparkUnified Execution Engine

0 likes · 9 min read

Velox: An Open‑Source Unified Execution Engine for Data Systems

Sohu Tech Products

Sep 7, 2022 · Big Data

Introducing the Fire Framework: Annotation‑Driven Development for Spark and Flink

The Fire framework, open‑source by ZTO Express, provides a unified annotation‑based programming model for real‑time Spark and Flink jobs, dramatically reducing boilerplate, simplifying configuration, and enabling rapid development of large‑scale data processing tasks with concise Scala code examples.

Fire FrameworkFlinkReal-time Processing

0 likes · 12 min read

Introducing the Fire Framework: Annotation‑Driven Development for Spark and Flink

政采云技术

Sep 6, 2022 · Big Data

Compiling and Deploying Spark 3.3.0 on CDH 6.3.2 (Cloudera) – Step‑by‑Step Guide

This guide explains how to download JDK, Maven, Scala and Spark 3.3.0, modify the Spark pom and configuration files for CDH 6.3.2, compile Spark with Maven, deploy the binaries to a client node, set up spark‑sql and spark‑submit scripts, and address common runtime issues.

CDHCompilationHadoop

0 likes · 13 min read

Compiling and Deploying Spark 3.3.0 on CDH 6.3.2 (Cloudera) – Step‑by‑Step Guide

ByteDance Cloud Native

Sep 2, 2022 · Big Data

How ByteDance’s Cloud Shuffle Service Boosts Big Data Job Stability and Performance

ByteDance’s Cloud Shuffle Service (CSS) replaces the traditional Pull‑Based Sort Shuffle in Spark, FlinkBatch and MapReduce with a Push‑Based remote shuffle that improves stability, performance and elasticity, supports compute‑storage separation, and delivers significant speedups in large‑scale TPC‑DS benchmarks.

Performance OptimizationRemote ShuffleShuffle Service

0 likes · 11 min read

How ByteDance’s Cloud Shuffle Service Boosts Big Data Job Stability and Performance

Architecture Digest

Aug 27, 2022 · Artificial Intelligence

Understanding Collaborative Filtering, Matrix Factorization, and Spark ALS for Recommendation Systems

This article explains the fundamentals of recommendation systems, introduces collaborative filtering (both user‑based and item‑based), derives the matrix‑factorization model with ALS optimization, provides a complete Python implementation, and demonstrates how to apply Spark ALS in both demo and production environments.

ALSSparkcollaborative filtering

0 likes · 29 min read

Understanding Collaborative Filtering, Matrix Factorization, and Spark ALS for Recommendation Systems

Big Data Technology Architecture

Aug 23, 2022 · Big Data

Apache Hudi 0.12.0 Release Highlights: Presto Connector, Archive Beyond Savepoint, File‑System Locks, Deltastreamer Termination, Spark & Flink Support, Performance Improvements, and Configuration Updates

The Apache Hudi 0.12.0 release introduces a native Presto connector, archive‑beyond‑savepoint capability, file‑system based locking, new deltastreamer termination strategies, expanded Spark and Flink support, numerous performance enhancements, and a series of configuration and API updates for better data‑lake management.

Apache HudiFlinkSpark

0 likes · 12 min read

Apache Hudi 0.12.0 Release Highlights: Presto Connector, Archive Beyond Savepoint, File‑System Locks, Deltastreamer Termination, Spark & Flink Support, Performance Improvements, and Configuration Updates

GuanYuan Data Tech Team

Aug 18, 2022 · Big Data

Why Spark’s compatiblePartitions Causes CPU Spikes and How to Fix It

The article investigates a Spark driver CPU overload caused by the compatiblePartitions method’s expensive permutation logic in window functions, explains the underlying O(n!) complexity, and presents a simplified implementation that eliminates the issue and has been merged into the official Spark codebase.

CPU optimizationSparkWindow Functions

0 likes · 7 min read

Why Spark’s compatiblePartitions Causes CPU Spikes and How to Fix It

Hulu Beijing

Aug 4, 2022 · Big Data

Unlock Seamless Object Serialization & Checkpoint Recovery in Spark with Neutrino

This article explains how Neutrino’s SerializableProvider API enables passing final classes, managing mutable object state, and supporting Spark checkpoint recovery through dependency injection, while also showing practical code patterns and injection of core Spark components.

Big DataCheckpointDependency Injection

0 likes · 8 min read

Unlock Seamless Object Serialization & Checkpoint Recovery in Spark with Neutrino

ITPUB

Aug 1, 2022 · Big Data

How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance

This article details Bilibili's evolution from a Hadoop‑based offline platform to a Spark‑driven architecture, covering the Hive‑to‑Spark migration, automated SQL conversion, result validation, stability enhancements, performance tuning, meta‑store federation, and future directions for large‑scale data processing.

Big DataData SkippingHive

0 likes · 31 min read

How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance

Big Data Technology & Architecture

Jul 28, 2022 · Big Data

Spark SQL UNION Causing driver.maxResultSize Error and Its Resolution

When executing a Spark SQL query with dozens of UNION subqueries that each contain JOIN operations on Spark 3.1.2, the job fails because the total serialized result size of the tasks exceeds the driver’s maxResultSize limit, and the issue can be resolved by reducing the initial partition number used by Adaptive Query Execution.

DriverMaxResultSizePerformanceTuningSQL

0 likes · 10 min read

Spark SQL UNION Causing driver.maxResultSize Error and Its Resolution

ITPUB

Jul 23, 2022 · Information Security

How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive

This article details Bilibili's implementation of Apache Ranger for fine‑grained access control across Hadoop, HDFS, Hive, Spark, and Presto, covering architecture, API redesign, admin optimizations, gray‑release strategies, permission pre‑checks, data masking, and future plans for incremental policy loading.

Access ControlData SecurityHDFS

0 likes · 16 min read

How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive

DataFunTalk

Jul 23, 2022 · Artificial Intelligence

Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster

This article presents a comprehensive overview of deploying and running graph learning algorithms—both inductive and transductive—on the secure Spark cluster, covering framework choices, data sampling strategies, distributed training techniques, model evaluation metrics, and future directions.

Big DataSparkdistributed training

0 likes · 13 min read

Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster

Bilibili Tech

Jul 22, 2022 · Information Security

Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform

Bilibili’s data platform redesigns Ranger‑based access control by simplifying HDFS and Hive policy APIs, parallelizing policy loading, adding gray‑release and pre‑check mechanisms, integrating fine‑grained Hive authorization with data‑masking, extending support to Spark and Presto, and planning incremental loading, policy fusion, and a NameNode proxy to boost security and performance.

Access ControlHDFSHive

0 likes · 15 min read

Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform

Alibaba Cloud Big Data AI Platform

Jul 21, 2022 · Big Data

Boosting Offline Data Warehouse Performance with DeltaLake: Key Strategies

This article details how Zuoyebang migrated its Hive‑based offline data warehouse to DeltaLake, addressing latency, incremental updates, and query performance through stream‑to‑batch processing, dynamic partition pruning, and Z‑order optimization, resulting in faster data readiness and analyst queries.

Big DataDeltaLakeHive

0 likes · 17 min read

Boosting Offline Data Warehouse Performance with DeltaLake: Key Strategies

vivo Internet Technology

Jul 20, 2022 · Artificial Intelligence

Collaborative Filtering and Matrix Factorization: Theory and Spark ALS Implementation

The article introduces collaborative filtering, derives the matrix‑factorization model R≈X·Yᵀ with L2‑regularized ALS updates, demonstrates a full Python example on a small rating matrix, then shows how to implement and scale Spark’s ALS for massive user‑item data, ending with production tips and references.

ALSRecommendation SystemsSpark

0 likes · 25 min read

Collaborative Filtering and Matrix Factorization: Theory and Spark ALS Implementation

DataFunTalk

Jul 17, 2022 · Big Data

Redesigning Apache SeaTunnel: Decoupling Source and Sink APIs for Multi‑Engine Support

The presentation details the motivations, goals, and architectural redesign of Apache SeaTunnel (Incubating) to decouple its Source and Sink APIs from underlying engines, introducing unified APIs, version‑agnostic connectors, and enhanced support for Spark and Flink in both batch and streaming scenarios.

Apache SeaTunnelBig DataData Integration

0 likes · 12 min read

Redesigning Apache SeaTunnel: Decoupling Source and Sink APIs for Multi‑Engine Support

Big Data Technology Architecture

Jul 15, 2022 · Big Data

Using and Designing the Apache SeaTunnel Examples Module

This article introduces Apache SeaTunnel's Examples module, compares SeaTunnel with DataX, explains its multi‑engine design, demonstrates Flink and Spark example implementations, and shares the speaker's experiences contributing to the open‑source community, providing practical guidance for big‑data integration projects.

Apache SeaTunnelData IntegrationFlink

0 likes · 10 min read

Using and Designing the Apache SeaTunnel Examples Module

GuanYuan Data Tech Team

Jul 14, 2022 · Big Data

How to Train Massive GBDT Models on Spark: A Complete Step‑by‑Step Guide

This article walks through using Apache Spark for large‑scale GBDT training, covering the challenges of massive data, Spark deployment, PySpark code examples, differences from Pandas, feature engineering, mmlspark installation, early‑stopping tricks, performance bottlenecks, and a systematic evaluation of alternative frameworks.

Big DataGBDTPerformance Optimization

0 likes · 38 min read

How to Train Massive GBDT Models on Spark: A Complete Step‑by‑Step Guide

dbaplus Community

Jul 13, 2022 · Big Data

Unpacking the Core Technologies Behind Modern Big Data Platforms

From data ingestion to real‑time analytics, this guide breaks down the essential layers of a typical big‑data platform—covering collection methods, HDFS storage, Hive/Spark analysis, data sharing mechanisms, application use‑cases, streaming with Spark Streaming, and the need for robust scheduling and monitoring.

Big DataData IntegrationData Warehouse

0 likes · 9 min read

Unpacking the Core Technologies Behind Modern Big Data Platforms

DataFunSummit

Jul 12, 2022 · Big Data

Practical Use of Apache Iceberg in Microvision's Data Warehouse: Architecture, Real‑time Integration, and Table Maintenance

This article details why Microvision adopted Apache Iceberg, how it replaces parts of their Lambda‑architecture data pipeline, the real‑time and offline use cases, table‑maintenance practices such as snapshot cleanup and small‑file merging, and lessons learned from the implementation.

Big DataData LakeFlink

0 likes · 17 min read

Practical Use of Apache Iceberg in Microvision's Data Warehouse: Architecture, Real‑time Integration, and Table Maintenance

Big Data Technology & Architecture

Jul 12, 2022 · Big Data

Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging

This article explains how Spark reads data from Apache Iceberg tables by parsing snapshots and manifest files into DataFile objects, creates Batch and InputPartition objects, uses readers to materialize InternalRows, and then demonstrates how Iceberg's RewriteDataFilesAction can merge tiny Parquet files into larger ones through Spark‑driven tasks.

Big DataData LakeIceberg

0 likes · 17 min read

Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging

Hulu Beijing

Jul 7, 2022 · Big Data

How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration

This article details Hulu's five‑year journey from Hadoop 2.6 to 3.3.2, covering major feature evolutions, the original cluster architecture, a comprehensive upgrade plan, compatibility challenges across HDFS, YARN, Hive, Spark and Flink, and the testing and rollout strategies that ensured a smooth migration.

Big DataFlinkHadoop

0 likes · 17 min read

How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration

Big Data Technology & Architecture

Jul 6, 2022 · Big Data

Understanding Apache Iceberg File Storage Format and Write Processes in Spark and Flink

This article explains the Apache Iceberg file storage format, its metadata hierarchy, and demonstrates how Spark and Flink write data to Iceberg tables, including detailed code examples, manifest handling, snapshot management, and commit processes for efficient data lake operations.

Apache IcebergBig DataData Lake

0 likes · 31 min read

Understanding Apache Iceberg File Storage Format and Write Processes in Spark and Flink

GuanYuan Data Tech Team

Jun 30, 2022 · Big Data

Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics

After upgrading Spark from 3.0.1 to 3.2.1 an ETL job began failing with OutOfMemory errors; this article examines the root causes, including AQE‑related metric accumulation, skipped stages, and stage‑metric growth, and presents a debugging process and a code‑level fix to mitigate memory pressure.

AQEBig DataOutOfMemory

0 likes · 13 min read

Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics

DataFunTalk

Jun 28, 2022 · Big Data

JD Retail Traffic Data Warehouse Architecture and Processing Practices

This article presents a comprehensive technical overview of JD.com’s retail traffic data processing pipeline, detailing the multi‑layer data warehouse architecture, real‑time and offline data flows, a large‑scale back‑fill case using Iceberg and OLAP, data‑skew detection and mitigation techniques, and future directions involving unified Flink‑Spark streaming‑batch solutions.

Data SkewFlinkIceberg

0 likes · 12 min read

JD Retail Traffic Data Warehouse Architecture and Processing Practices

ITPUB

Jun 25, 2022 · Big Data

How Spark SQL’s Catalyst Optimizer Accelerates Big Data Queries

This article explains Apache Spark’s role in large‑scale data processing, traces the evolution from Shark to Spark SQL’s DataFrame and Dataset APIs, and details the internal Catalyst optimizer—including its rule‑based and cost‑based strategies—through step‑by‑step examples and code snippets.

CatalystOptimizationSQL

0 likes · 11 min read

How Spark SQL’s Catalyst Optimizer Accelerates Big Data Queries

Big Data Technology Architecture

Jun 8, 2022 · Big Data

Bilibili Offline Computing Platform: Migration from Hive to Spark and Comprehensive Performance Optimizations

The article details Bilibili's evolution of its offline computing platform from Hadoop‑based Hive to Spark, describing migration tools, SQL conversion, result and resource comparison, shuffle stability, small‑file handling, runtime filters, data skipping, ZSTD support, Hive Metastore federation, traffic control, and future optimization directions.

Data MigrationHiveResource Management

0 likes · 29 min read

Bilibili Offline Computing Platform: Migration from Hive to Spark and Comprehensive Performance Optimizations

Bilibili Tech

May 31, 2022 · Big Data

Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices

Bilibili migrated its massive offline platform from Hive to Spark using an automated SQL rewrite and dual‑run verification, cutting execution time over 40% and resource use 30%, while introducing small‑file merging, shuffle stability, runtime filters, data‑skipping, lineage tracking, auto‑parameter tuning, and metastore federation for robust large‑scale processing.

Big DataData EngineeringHive

0 likes · 30 min read

Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices

Architecture Digest

May 23, 2022 · Big Data

Overview of Core Technologies in a Big Data Platform Architecture

This article explains the main layers of a typical big data platform—data collection, storage and analysis, sharing, and application—detailing common tools such as Flume, DataX, Hive, Spark, SparkSQL, Impala, and Spark Streaming, and discusses task scheduling and monitoring in the ecosystem.

Data PlatformDataXHadoop

0 likes · 10 min read

Overview of Core Technologies in a Big Data Platform Architecture

Alibaba Cloud Developer

May 17, 2022 · Artificial Intelligence

How Databricks and Prophet Power Retail Demand Forecasting for Store‑Item Sales

This article walks through why accurate demand forecasting is critical for retailers, shows how to prepare and visualize sales data, demonstrates building a store‑item model with Databricks DDI and Facebook Prophet, and explains scaling the model to predict every product across all stores, highlighting performance metrics and practical tips.

DatabricksProphetSpark

0 likes · 7 min read

How Databricks and Prophet Power Retail Demand Forecasting for Store‑Item Sales

Big Data Technology & Architecture

May 17, 2022 · Big Data

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

This article provides a comprehensive guide to Apache Hudi, covering its basic concepts, timeline architecture, storage types (Copy‑On‑Write and Merge‑On‑Read), write operations, DeltaStreamer usage, Hive/Spark/Presto query integration, data management, indexing, compaction, and best‑practice recommendations for big‑data lake workloads.

Apache HudiBig DataCopy-on-Write

0 likes · 43 min read

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

Baidu Geek Talk

May 9, 2022 · Big Data

How a Spark Offline Framework Boosts Data Backtracking Efficiency

This article introduces a Spark offline development framework that separates configuration from code, supports SQL and Java applications, and provides fast, automated data backtracking with reduced environment preparation time, lower failure rates, and significant performance gains for large‑scale data warehouses.

Big DataData BacktrackingJava

0 likes · 17 min read

How a Spark Offline Framework Boosts Data Backtracking Efficiency

Big Data Technology & Architecture

May 4, 2022 · Big Data

Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities

The Apache Hudi 0.11.0 release introduces multi‑mode metadata indexing, enhanced data‑skipping, asynchronous indexing, extensive Spark and Flink integration improvements, new bundle utilities, and expanded metadata synchronization with BigQuery, AWS Glue, and DataHub, while also adding bucket indexing and encryption support.

Apache HudiAsync IndexBig Data

0 likes · 13 min read

Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities

ITPUB

Apr 26, 2022 · Big Data

Mastering Delta Lake: From Data Lake Basics to Hands‑On Implementation

This article explains the fundamentals of data lakes and data warehouses, compares their architectures, outlines the challenges of data lakes, and then dives deep into Delta Lake's core features, storage model, ACID guarantees, concurrency handling, and provides step‑by‑step Spark code examples for practical use.

ACIDCopy-on-WriteData Lake

0 likes · 18 min read

Mastering Delta Lake: From Data Lake Basics to Hands‑On Implementation

ITPUB

Apr 19, 2022 · Big Data

Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive

This article explains why modern enterprises need real‑time data‑warehouse architectures, breaks down traditional layered warehouse concepts, compares Lambda and Kappa models, evaluates five practical real‑time solutions—including Iceberg‑based lakehouse and MPP databases—provides code snippets, and offers selection guidance with real‑world company examples.

Big DataFlinkIceberg

0 likes · 19 min read

Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive

Alibaba Cloud Developer

Apr 18, 2022 · Big Data

What Is Delta Lake? A Deep Dive into the Lakehouse Evolution and Features

This article explains the evolution from traditional data warehouses to data lakes and the modern Lakehouse architecture, introduces Delta Lake's core concepts, multi‑hop medallion tables, ACID transactions, generated columns, standalone support, and future open‑source directions.

Data ManagementDelta LakeGenerated Columns

0 likes · 13 min read

What Is Delta Lake? A Deep Dive into the Lakehouse Evolution and Features

JavaEdge

Apr 17, 2022 · Big Data

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

The article explains how Spark, developed by UC Berkeley's AMP Lab, quickly surpassed MapReduce by offering faster execution, a simpler Scala‑based programming model, lazy RDD transformations, a rich ecosystem including SQL, Streaming, MLlib and GraphX, and practical code examples such as a three‑line WordCount.

Big DataMapReduceRDD

0 likes · 7 min read

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

DataFunSummit

Apr 16, 2022 · Big Data

Angel Graph: A Scalable Graph Computing Platform – Architecture, Optimizations, and Applications

The article introduces Angel Graph, a large‑scale graph computing platform built on Angel's parameter‑server architecture and Spark, detailing its evolution, framework components (including Spark‑on‑Angel and PyTorch‑on‑Angel), data and model partitioning strategies, communication and computation optimizations, stability mechanisms, usability features, and real‑world applications across recommendation, risk control, social and gaming domains.

OptimizationParameter ServerPyTorch

0 likes · 15 min read

Angel Graph: A Scalable Graph Computing Platform – Architecture, Optimizations, and Applications

Zuoyebang Tech Team

Apr 13, 2022 · Big Data

How Delta Lake Transformed Our Offline Data Warehouse Performance

This article details how ZuoYeBang's engineering team migrated their Hive‑based offline data warehouse to Delta Lake, tackling latency, scalability, and query‑performance challenges through stream‑to‑batch processing, data‑lake architecture, and optimizations like DPP and Z‑ordering.

Big DataDelta LakeHive

0 likes · 15 min read

How Delta Lake Transformed Our Offline Data Warehouse Performance

Volcano Engine Developer Services

Apr 11, 2022 · Big Data

How ByteDance Cut Spark History Storage by 90% with a Cloud‑Native UIService

ByteDance rebuilt Spark's History Server into a cloud‑native UIService that stores only essential UI metadata, reducing storage usage by over 90%, cutting UI latency by up to 94%, and enabling seamless horizontal scaling for large‑scale analytics workloads.

History ServerPerformance OptimizationSpark

0 likes · 12 min read

How ByteDance Cut Spark History Storage by 90% with a Cloud‑Native UIService

DataFunTalk

Apr 10, 2022 · Big Data

Angel Graph: A Large-Scale Graph Computing Platform by Tencent

This article introduces Tencent's Angel Graph platform, detailing its evolution from early versions to a mature large‑scale graph computing system, its architecture combining Angel PS with Spark and PyTorch, data and model partitioning strategies, communication and computation optimizations, stability features, usability, and real‑world applications.

Angel GraphGraph Neural NetworkSpark

0 likes · 15 min read

Angel Graph: A Large-Scale Graph Computing Platform by Tencent

Big Data Technology & Architecture

Mar 31, 2022 · Big Data

Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg

To address the high cost and low efficiency of traditional Hadoop‑based data pipelines, Bilibili designed a lakehouse solution using Apache Iceberg, integrating Spark, Flink, Trino, and Alluxio to unify flexible data lake storage with warehouse‑level query performance, reducing data duplication and improving interactive analytics.

Big DataData WarehouseIceberg

0 likes · 17 min read

Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg

NetEase Smart Enterprise Tech+

Mar 29, 2022 · Big Data

Automating Consumer Insight Testing with Spark, Hive, and ClickHouse

This article explains how to build a big‑data consumer insight platform using Spark applications, Hive, MySQL and ClickHouse, and how to automate data validation and algorithm testing to improve coverage, efficiency, and reliability of insight services.

Big DataClickHouseSpark

0 likes · 8 min read

Automating Consumer Insight Testing with Spark, Hive, and ClickHouse

GuanYuan Data Tech Team

Mar 24, 2022 · Big Data

Why Do Spark Card Queries Take 10 Seconds? Uncovering a NAS Mount Issue

A customer’s Spark card queries were consistently taking around 10 seconds, prompting a step‑by‑step investigation that revealed a misconfigured NAS mount option (lookupcache=none) as the root cause of the severe slowdown.

Big DataNASSpark

0 likes · 7 min read

Why Do Spark Card Queries Take 10 Seconds? Uncovering a NAS Mount Issue

Big Data Technology & Architecture

Mar 22, 2022 · Big Data

Integrating Hive Data Warehouse with ClickHouse Using Seatunnel: A Step‑by‑Step Guide

This article provides a comprehensive, hands‑on tutorial for connecting a Hive data warehouse to ClickHouse via Seatunnel, covering environment setup, Hive and ClickHouse table creation, full and incremental data import scripts, execution examples, and practical troubleshooting tips.

Big DataClickHouseData Integration

0 likes · 10 min read

Integrating Hive Data Warehouse with ClickHouse Using Seatunnel: A Step‑by‑Step Guide

IT Services Circle

Mar 21, 2022 · Big Data

Understanding Spark Shuffle: Hash, Sort, and Tungsten Sort Mechanisms

This article explains the evolution and inner workings of Spark's shuffle phase, comparing the original Hash‑based shuffle, the default Sort‑based shuffle, the optimized Tungsten‑Sort shuffle, and related configuration options that affect performance and file handling in large‑scale data processing.

Hash ShuffleShuffleSort-Shuffle

0 likes · 17 min read

Understanding Spark Shuffle: Hash, Sort, and Tungsten Sort Mechanisms

ByteDance Data Platform

Mar 14, 2022 · Big Data

How ByteDance Cut Spark History Server Storage by 90% and Boost UI Speed

ByteDance’s Spark History Server was re‑engineered into a cloud‑native UIService that reduces storage usage by over 90%, cuts UI latency by up to 94%, and enables horizontal scaling, dramatically improving the user experience for large‑scale Spark jobs.

Cloud NativeHistory ServerPerformance Optimization

0 likes · 12 min read

How ByteDance Cut Spark History Server Storage by 90% and Boost UI Speed

Big Data Technology & Architecture

Mar 7, 2022 · Big Data

Apache Griffin: An Overview of the Big Data Data‑Quality Monitoring Tool

This article introduces Apache Griffin, a model‑driven big‑data data‑quality monitoring platform, explains its key features, architecture, installation requirements, and provides step‑by‑step usage examples with Hive, Kafka and Spark integration.

Apache GriffinBig DataData Quality

0 likes · 9 min read

Apache Griffin: An Overview of the Big Data Data‑Quality Monitoring Tool

Big Data Technology & Architecture

Mar 4, 2022 · Big Data

Managing Small Files in Apache Hudi and Spark Optimization Guide

The article explains how Apache Hudi automatically manages file sizes to avoid small‑file issues, details key configuration parameters, provides a step‑by‑step example of file merging, and offers practical Spark tuning recommendations for optimal performance in data‑lake workloads.

Apache HudiBig DataData Lake

0 likes · 11 min read

Managing Small Files in Apache Hudi and Spark Optimization Guide

vivo Internet Technology

Feb 23, 2022 · Big Data

Kafka-based Real-Time Data Warehouse: Architecture and Practice for Search

The article explains how Kafka serves as the core of a real‑time data warehouse for search, detailing its advantages over traditional databases, integration with Flink for low‑latency stream processing, architectural patterns such as Lambda/Kappa, scaling challenges, and comprehensive monitoring using Kafka Eagle.

Apache KafkaData IntegrationFlink

0 likes · 15 min read

Kafka-based Real-Time Data Warehouse: Architecture and Practice for Search

HomeTech

Feb 15, 2022 · Artificial Intelligence

Horovod Distributed Deep Learning Training: Architecture, Performance, and Kubernetes Deployment

This article provides a comprehensive overview of Horovod, Uber's open-source distributed deep learning framework, covering its architecture, communication mechanisms, performance benchmarks, and deployment on Kubernetes and Spark for accelerated multi-GPU training.

GPU AccelerationHorovodRing AllReduce

0 likes · 17 min read

Horovod Distributed Deep Learning Training: Architecture, Performance, and Kubernetes Deployment

DataFunTalk

Feb 12, 2022 · Big Data

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

This article introduces NetEase's internally incubated data lake project Arctic, explains the concept of data lakes, outlines NetEase's specific requirements for a unified streaming‑batch platform, details Arctic's core architecture, storage strategy, data‑merge mechanisms, current achievements, and future development plans.

Apache IcebergArcticBig Data

0 likes · 10 min read

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

JD Retail Technology

Feb 11, 2022 · Big Data

Runtime Filter Join Optimization in JD Spark Using Bloom Filters

This article details JD Spark's Runtime Filter Join optimization, which leverages Bloom filters to prune large‑table data before shuffle, reducing I/O and execution time across batch and real‑time workloads, and presents architecture, implementation challenges, code examples, and performance gains in both benchmark and production environments.

Runtime Filter JoinShuffle ReductionSpark

0 likes · 15 min read

Runtime Filter Join Optimization in JD Spark Using Bloom Filters

IT Architects Alliance

Feb 8, 2022 · Backend Development

Designing a Daily Million-Transaction Payment Reconciliation System

This article explains how to architect a payment reconciliation system that can reliably process tens of millions of transactions per day, covering the underlying logic, scalability challenges, data collection methods, big‑data integration, and step‑by‑step processing flows to ensure accurate financial matching.

Big DataHiveSpark

0 likes · 32 min read

Designing a Daily Million-Transaction Payment Reconciliation System

DataFunTalk

Feb 3, 2022 · Big Data

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

This article explains how Kuashou tackled latency and efficiency problems in large‑scale data pipelines by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, its architecture, model design, handling of bursty updates, back‑fill scenarios, and operational safeguards.

Big DataData LakeFlink

0 likes · 13 min read

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

JD Retail Technology

Jan 27, 2022 · Big Data

How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs

This article explains JD’s self‑developed data‑skew mitigation solution for Spark, detailing the problem of uneven key distribution, the limitations of the open‑source AQE implementation, and JD’s OptimizeSkewedJoinV2 algorithm that dramatically reduces stage latency in large‑scale join workloads.

Adaptive Query ExecutionBig DataData Skew

0 likes · 13 min read

How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs

DataFunTalk

Jan 27, 2022 · Big Data

Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing

This article introduces Kyuubi, the first NetEase project contributed to the Apache Foundation, describing its core features, multi‑tenant architecture, Spark‑based execution engine, cloud‑native capabilities, and real‑world use cases within NetEase’s data‑warehouse, ad‑hoc, and internal systems, along with performance gains and community resources.

Big DataKyuubiMulti‑Tenant

0 likes · 23 min read

Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing

Architect

Jan 7, 2022 · Big Data

Spark Performance Optimization: Principles, Memory Model, Resource Tuning, Data Skew and Shuffle Tuning

This article provides an in‑depth guide to Spark performance optimization, covering the ten development principles, static and unified memory models, resource parameter tuning, data skew detection and mitigation techniques, as well as shuffle‑related configuration adjustments, supplemented with practical code examples and diagrams.

Data SkewMemory ModelPerformance Tuning

0 likes · 40 min read

Spark Performance Optimization: Principles, Memory Model, Resource Tuning, Data Skew and Shuffle Tuning

Big Data Technology & Architecture

Dec 31, 2021 · Big Data

Apache SeaTunnel Joins the Apache Incubator: Overview, Features, and Real‑World Use Cases

SeaTunnel, the China‑originated data‑integration platform built on Spark and Flink, has been accepted into the Apache Incubator, and this article introduces its history, architecture, plugin ecosystem, deployment requirements, and numerous enterprise deployments across batch and streaming big‑data scenarios.

Big DataData IntegrationETL

0 likes · 7 min read

Apache SeaTunnel Joins the Apache Incubator: Overview, Features, and Real‑World Use Cases

DataFunTalk

Dec 27, 2021 · Big Data

Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies

This article presents a detailed interview-style walkthrough covering Hadoop cluster setup, HDFS components, MapReduce workflow, YARN advantages, Spark fundamentals, Kafka replication, Hive table types, and related big‑data concepts, providing concise explanations and practical insights for data engineers.

Big DataHadoopHive

0 likes · 20 min read

Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies

DataFunTalk

Dec 25, 2021 · Artificial Intelligence

Optimizing Spark‑ML Linear Models with Project Matrix: Background, Progress, and Future Plans

This article introduces the Project Matrix initiative that re‑examines and restructures Spark‑ML linear models, detailing the background of Spark‑ML usage at JD, the performance‑focused optimizations such as blockification and virtual centering, and outlines upcoming work to further improve scalability and accuracy.

Big DataPerformance OptimizationSpark

0 likes · 9 min read

Optimizing Spark‑ML Linear Models with Project Matrix: Background, Progress, and Future Plans

Big Data Technology & Architecture

Dec 23, 2021 · Big Data

Key Spark Configuration Parameters and Their Explanations

This article presents a comprehensive list of essential Spark configuration settings—including executor memory, off‑heap memory, memory fractions, shuffle options, and adaptive query execution parameters—each accompanied by a concise description to help users fine‑tune Spark performance.

Adaptive Query ExecutionBig DataMemory Management

0 likes · 6 min read

Key Spark Configuration Parameters and Their Explanations

Big Data Technology & Architecture

Dec 21, 2021 · Big Data

Understanding Spark 3.0 Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP)

This article explains the two most important Spark 3.0 features—Adaptive Query Execution and Dynamic Partition Pruning—detailing how AQE dynamically optimizes join strategies, partition coalescing, and skew handling, while DPP reduces I/O by pruning irrelevant fact‑table partitions at runtime.

Adaptive Query ExecutionBig DataDynamic Partition Pruning

0 likes · 10 min read

Understanding Spark 3.0 Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP)

Big Data Technology & Architecture

Dec 19, 2021 · Big Data

Understanding Spark Catalyst and Tungsten Optimizations in Spark SQL

This article explains how Spark SQL's Catalyst optimizer performs logical and physical planning, details the Tungsten engine's data‑structure and whole‑stage code generation improvements, compares them with the Volcano iterator model, and provides code examples and PDF resources for deeper study.

Big DataCatalystSpark

0 likes · 12 min read

Understanding Spark Catalyst and Tungsten Optimizations in Spark SQL

Big Data Technology & Architecture

Dec 16, 2021 · Big Data

Understanding Spark SQL Join Strategies, Catalyst Optimizer, and Tungsten for Big Data Processing

This article explains Spark SQL join classifications, the mechanics of Nested Loop Join, Sort‑Merge Join, and Hash Join, and describes how the Catalyst optimizer and Tungsten project improve query execution and memory efficiency in large‑scale data environments.

Big DataCatalystJOIN

0 likes · 9 min read

Understanding Spark SQL Join Strategies, Catalyst Optimizer, and Tungsten for Big Data Processing

JD Cloud Developers

Dec 15, 2021 · Big Data

How JD Retail Scales Billion‑Item Selection with ClickHouse & Elasticsearch

This article details JD Retail's strategic "Nirvana" product‑selection platform, describing the technical challenges of handling billions of items and hundreds of tags, and presenting a dual‑engine solution using ClickHouse and Elasticsearch with Spark‑driven data pipelines to achieve fast filtering, multidimensional analytics, and efficient storage.

Big DataClickHouseData Engineering

0 likes · 15 min read

How JD Retail Scales Billion‑Item Selection with ClickHouse & Elasticsearch

Big Data Technology & Architecture

Dec 15, 2021 · Big Data

Understanding Spark DataFrames: Creation Methods, Optimizations, and Common Operations

This article explains the origins of Spark DataFrames, compares them with RDDs, describes how Spark SQL optimizes DataFrame execution, and provides detailed examples of creating DataFrames from RDDs, files, and JDBC sources along with common DataFrame operations and code snippets.

Big DataSQLScala

0 likes · 10 min read

Understanding Spark DataFrames: Creation Methods, Optimizations, and Common Operations

Big Data Technology & Architecture

Dec 6, 2021 · Big Data

Understanding Spark’s Memory Model: Unified Memory Management, On‑Heap and Off‑Heap Memory, and Configuration

This article explains Spark’s unified memory management model, detailing the division between on‑heap and off‑heap memory, the roles of execution, storage, user, and reserved memory, configuration parameters, dynamic allocation, and how these concepts affect performance and resource utilization.

Execution MemoryMemory ManagementOff-Heap

0 likes · 17 min read

Understanding Spark’s Memory Model: Unified Memory Management, On‑Heap and Off‑Heap Memory, and Configuration

Big Data Technology & Architecture

Dec 4, 2021 · Big Data

Understanding Spark's BlockManager, MemoryStore, and DiskStore

This article explains Spark's storage architecture, detailing the roles and interactions of BlockManager, MemoryStore, and DiskStore, including their initialization, data management mechanisms, code implementations, and eviction strategies, to help readers grasp how Spark efficiently handles in‑memory and on‑disk data.

Big DataBlockManagerDiskStore

0 likes · 12 min read

Understanding Spark's BlockManager, MemoryStore, and DiskStore

Big Data Technology & Architecture

Dec 1, 2021 · Big Data

Understanding Spark Shuffle: Mechanisms, Evolution, and Optimization

This article provides a comprehensive overview of Spark's shuffle process, explaining its definition, internal mechanisms such as shuffle write and read, the evolution of shuffle managers, and practical optimization techniques including parameter tuning and broadcast variables, all aimed at improving performance in large‑scale data processing.

Big DataShuffleShuffle Reader

0 likes · 18 min read

Understanding Spark Shuffle: Mechanisms, Evolution, and Optimization

Big Data Technology & Architecture

Dec 1, 2021 · Big Data

Understanding Spark Core, RDD, and Scheduler Components: A Practical Guide

This article introduces Spark's core concepts, explains the RDD abstraction and its four main properties, and details the roles of DAGScheduler, SchedulerBackend, TaskScheduler, and ExecutorBackend, providing practical insights for beginners and intermediate users in big‑data processing.

Big DataDAGSchedulerRDD

0 likes · 9 min read

Understanding Spark Core, RDD, and Scheduler Components: A Practical Guide

Big Data Technology & Architecture

Nov 30, 2021 · Big Data

Curated Learning Resources for Big Data and Data Engineering

This article compiles a comprehensive list of Chinese-language articles and tutorials covering big‑data technologies such as Flink, Spark, Hive, ClickHouse, data governance, and related interview preparation resources, providing a structured learning path for aspiring data engineers.

Big DataClickHouseData Governance

0 likes · 4 min read

Curated Learning Resources for Big Data and Data Engineering

Big Data Technology Architecture

Nov 24, 2021 · Big Data

Using Iceberg Catalogs with HiveCatalog and HadoopCatalog: Table Creation, Data Ingestion, and Querying

This article explains the concept of Iceberg catalogs, compares HiveCatalog and HadoopCatalog, and provides step‑by‑step Spark examples for downloading the Iceberg jar, creating tables, loading data, querying, and examining the underlying metadata and directory structures.

HadoopCatalogHiveCatalogIceberg

0 likes · 15 min read

Using Iceberg Catalogs with HiveCatalog and HadoopCatalog: Table Creation, Data Ingestion, and Querying

Big Data Technology & Architecture

Nov 22, 2021 · Big Data

Comprehensive Big Data Learning Path and Resource Guide

This article presents a detailed learning roadmap for aspiring big‑data experts, covering foundational programming languages, data structures, Linux basics, databases, distributed system theory, and essential frameworks such as Hadoop, Spark, Flink, Kafka, and provides curated B‑site video links and reference materials.

Big DataFlinkHadoop

0 likes · 9 min read

Comprehensive Big Data Learning Path and Resource Guide

DataFunTalk

Nov 20, 2021 · Big Data

How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices

This article provides a comprehensive guide to designing and implementing a big‑data platform, covering architecture overview, data ingestion with Flume, storage on HDFS/Hive/HBase, processing engines such as Hive, Spark and Flink, scheduling solutions like Azkaban and Airflow, and the construction of self‑service analytics systems.

Big DataData EngineeringETL

0 likes · 29 min read

How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices

Big Data Technology Architecture

Nov 16, 2021 · Big Data

Understanding Adaptive Query Execution and Dynamic Partition Pruning in Apache Spark 3.0

This article explains how Apache Spark 3.0 improves SQL workload performance through Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP), detailing their design principles, runtime optimizations, configuration parameters, and practical examples that demonstrate reduced shuffle partitions, smarter join strategies, and handling of data skew.

Adaptive Query ExecutionDynamic Partition PruningSpark

0 likes · 9 min read

Understanding Adaptive Query Execution and Dynamic Partition Pruning in Apache Spark 3.0

Big Data Technology Architecture

Nov 13, 2021 · Big Data

Analysis of Spark Out‑Of‑Memory (OOM) Issues and Tuning Strategies

This article explains Spark's memory model, identifies common driver and executor OOM scenarios, and provides detailed configuration and code‑level recommendations—including memory‑related parameters and shuffle‑tuning options—to prevent and resolve out‑of‑memory failures in Spark applications.

Memory ManagementOOMPerformance Tuning

0 likes · 11 min read

Analysis of Spark Out‑Of‑Memory (OOM) Issues and Tuning Strategies

Big Data Technology & Architecture

Nov 8, 2021 · Big Data

Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices

This article examines the strengths and weaknesses of Apache Iceberg, explains why Tencent selected it over alternatives, details Tencent’s own enhancements and integration with Flink, Spark, and other engines, and shares multiple real‑world implementations for building enterprise‑grade real‑time data lakes.

Apache IcebergData LakeFlink

0 likes · 17 min read

Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices

Big Data Technology & Architecture

Oct 23, 2021 · Big Data

Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage

This article provides a comprehensive overview of Hive's execution engines—including MapReduce, Tez, and Spark—detailing their architectures, the six-stage Hive SQL compilation process, practical Explain syntax examples, and extensive tuning parameters for each engine to improve performance in big‑data environments.

EXPLAINHiveMapReduce

0 likes · 48 min read

Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage

DataFunTalk

Oct 18, 2021 · Big Data

Building an Intelligent Data Warehouse at Yixin Group: A Big Data Platform Case Study

The article describes how Yixin Group’s product team created an in‑house intelligent data warehouse using Hadoop, Flink/Spark, and standardized data services to transform scattered automotive‑finance data into a secure, scalable platform that supports real‑time analytics and drives business growth.

Big DataData EngineeringFlink

0 likes · 10 min read

Building an Intelligent Data Warehouse at Yixin Group: A Big Data Platform Case Study