Tagged articles

Shuffle

63 articles · Page 1 of 1

Nov 24, 2025 · Fundamentals

How to Randomly Shuffle Array Elements in PHP with shuffle()

Learn how to use PHP's built-in shuffle() function to randomly reorder array elements, with clear syntax explanation, step-by-step code examples, output demonstration, and important considerations such as in‑place modification, handling associative or multidimensional arrays, and preserving original data when needed.

ArrayShufflebackend

0 likes · 3 min read

How to Randomly Shuffle Array Elements in PHP with shuffle()

php Courses

Nov 17, 2025 · Backend Development

How to Randomly Shuffle Arrays in PHP: shuffle() and array_rand() Explained

Learn how to randomize array order in PHP using built-in functions like shuffle() and array_rand(), with clear code examples and a practical scenario of assigning students to classes, plus step-by-step explanations of each method’s behavior and usage.

ArrayPHPShuffle

0 likes · 2 min read

How to Randomly Shuffle Arrays in PHP: shuffle() and array_rand() Explained

php Courses

May 19, 2025 · Backend Development

Using PHP shuffle() to Randomly Rearrange Array Elements

This article explains the PHP shuffle() function, detailing its syntax, behavior of modifying the original indexed array, return value, usage with both indexed and associative arrays, and provides multiple code examples demonstrating random reordering and the effect on array keys.

ArrayPHPShuffle

0 likes · 5 min read

Using PHP shuffle() to Randomly Rearrange Array Elements

php Courses

Mar 24, 2025 · Backend Development

Using PHP shuffle() to Randomly Rearrange Array Elements

This article explains PHP's shuffle() function, detailing its syntax, behavior of modifying the original array, return value, usage with indexed and associative arrays, and provides multiple code examples illustrating how to randomize array elements.

ArrayPHPShuffle

0 likes · 4 min read

Past Memory Big Data

Dec 24, 2024 · Big Data

Magnet: A Push‑Based Shuffle Service that Scales to Petabyte‑Level Data Processing

LinkedIn’s massive Spark workloads suffer from shuffle bottlenecks caused by tiny shuffle blocks, unreliable RPC connections, and data skew, so the authors design Magnet—a push‑merge shuffle service that merges blocks into large chunks, improves disk I/O, tolerates failures, and cuts end‑to‑end job time by nearly 30% regardless of hardware.

Disk I/O optimizationLarge‑scale data processingPush‑based service

0 likes · 56 min read

Magnet: A Push‑Based Shuffle Service that Scales to Petabyte‑Level Data Processing

php Courses

Oct 28, 2024 · Backend Development

How to Use PHP shuffle() to Randomly Rearrange Array Elements

This article explains the PHP shuffle() function, detailing its syntax, behavior of modifying the original indexed array, return value, and provides multiple code examples—including handling of non-indexed arrays—to demonstrate how to randomly reorder array elements in PHP.

ArrayFunctionPHP

0 likes · 4 min read

How to Use PHP shuffle() to Randomly Rearrange Array Elements

php Courses

Aug 28, 2024 · Backend Development

How to Use PHP shuffle() to Randomly Rearrange Array Elements

This article explains PHP's shuffle() function, its syntax, behavior on indexed and associative arrays, and provides code examples demonstrating how to randomize array elements and handle the function's boolean return value in practice.

ArrayShufflebackend

0 likes · 5 min read

360 Smart Cloud

Jul 9, 2024 · Big Data

Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)

This article examines the critical role of shuffle in big‑data processing, compares Spark's native shuffle with the External Shuffle Service (ESS) and Remote Shuffle Service (RSS) solutions, introduces Uniffle's architecture and configuration, and shares practical deployment experiences and performance results within a 360 internal environment.

Big DataExternal Shuffle ServiceRemote Shuffle Service

0 likes · 15 min read

Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)

DataFunTalk

Jun 22, 2024 · Big Data

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

This article details Zhihu's migration of massive Spark and MapReduce shuffle workloads from the External Shuffle Service (ESS) to a push‑based Remote Shuffle Service (RSS) powered by Celeborn, covering background problems, evaluation of open‑source implementations, deployment architecture, encountered issues, solutions, performance gains, and future plans.

Big DataPerformanceRSS

0 likes · 19 min read

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

DataFunSummit

Mar 20, 2024 · Big Data

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's large‑scale evolution of Spark Shuffle to a cloud‑native architecture, describing background, stability and mixed‑resource scenarios, challenges such as CPU and I/O limits, custom ESS enhancements, shuffle throttling, spill‑split mechanisms, and the Cloud Shuffle Service with its push‑based design and performance gains.

Big DataKubernetesPerformance Optimization

0 likes · 21 min read

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

Huolala Tech

Mar 7, 2024 · Big Data

Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience

Facing exploding data volumes and rising cluster costs, HuoLala adopted Apache Tez’s Remote Shuffle Service built on Apache Uniffle, redesigning the Tez client to operate without source modifications, detailing architecture, implementation challenges, testing, stability measures, and future plans to enhance big‑data shuffle performance and cost efficiency.

Apache TezBig DataData Engineering

0 likes · 14 min read

Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience

php Courses

Mar 7, 2024 · Backend Development

How to Randomly Shuffle an Array in PHP Using the shuffle Function

This article explains the PHP shuffle function, its syntax, how it directly modifies an array to randomize element order, provides example code with output, and discusses important considerations such as preserving the original array and handling associative or multidimensional arrays.

ArrayPHPShuffle

0 likes · 3 min read

How to Randomly Shuffle an Array in PHP Using the shuffle Function

php Courses

Jan 29, 2024 · Backend Development

How to Use PHP shuffle() to Randomly Sort Arrays and Generate Random Numbers

This article explains the PHP shuffle() function, demonstrates how to create arrays, use shuffle() to randomize their elements, display the results, and shows additional uses such as generating random numbers with range() and shuffle(), providing clear code examples throughout.

ArraysPHPShuffle

0 likes · 4 min read

How to Use PHP shuffle() to Randomly Sort Arrays and Generate Random Numbers

Alibaba Cloud Developer

Jan 11, 2024 · Big Data

Unlock ODPS SQL Performance: Deep Dive into Execution Plans & Optimizations

This article examines ODPS SQL performance by dissecting logical execution plans and Logview visualizations, explaining the underlying principles of various optimization techniques such as multi‑distinct handling, shuffle reduction, system parameters, and different join strategies, and demonstrates how to apply these methods to improve query efficiency in real‑world data engineering tasks.

Execution PlanODPSShuffle

0 likes · 17 min read

Unlock ODPS SQL Performance: Deep Dive into Execution Plans & Optimizations

php Courses

Dec 25, 2023 · Backend Development

How to Randomly Shuffle Array Elements Using PHP's shuffle Function

This article explains how to use PHP's built-in shuffle() function to randomly reorder array elements, covering its syntax, return value, example code for indexed and associative arrays, handling of multidimensional arrays, and important considerations such as in‑place modification and preserving original data.

ArrayPHPShuffle

0 likes · 3 min read

How to Randomly Shuffle Array Elements Using PHP's shuffle Function

Zhongtong Tech

Dec 14, 2023 · Big Data

How Celeborn Transformed Spark Shuffle Performance at ZTO Express

Facing massive daily Spark shuffle volumes and unstable ETL performance, ZTO Express migrated from the community External Shuffle Service to Celeborn's Remote Shuffle Service, achieving higher disk I/O efficiency, better reliability, reduced network connections, and significant reductions in task failures and job latency.

Big DataRemote Shuffle ServiceShuffle

0 likes · 15 min read

How Celeborn Transformed Spark Shuffle Performance at ZTO Express

php Courses

Dec 8, 2023 · Backend Development

Using PHP shuffle() to Randomly Rearrange Array Elements

This article explains PHP's shuffle() function, its syntax, behavior, return value, and demonstrates how it randomizes both indexed and associative arrays with code examples, highlighting that it modifies the original array and reindexes non‑sequential keys.

ArrayShufflebackend

0 likes · 5 min read

DataFunTalk

Nov 18, 2023 · Big Data

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's extensive migration of Spark Shuffle to a cloud‑native architecture, describing the massive data volumes, the underlying ESS and CSS services, the challenges of resource isolation, monitoring, throttling, spill‑splitting, and the performance gains achieved across stable and mixed‑resource clusters.

Big DataByteDanceCloud Native

0 likes · 20 min read

php Courses

Aug 1, 2023 · Backend Development

Using PHP shuffle() Function to Randomly Reorder Array Elements

This article explains the PHP shuffle() function, detailing its syntax, return behavior, usage examples, and important considerations such as its effect on the original array, limitations with associative arrays, and handling of duplicate elements, providing a practical code demonstration.

ArrayPHPShuffle

0 likes · 3 min read

Using PHP shuffle() Function to Randomly Reorder Array Elements

JD Tech

Jun 14, 2023 · Big Data

Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)

This article explains the concept of data skew in offline big‑data jobs, describes its symptoms and root causes, and provides practical optimization techniques for Hive and Spark—including partitioning strategies, map‑join usage, adaptive query settings, and monitoring approaches—to prevent performance degradation and runtime failures.

Data SkewHiveOptimization

0 likes · 17 min read

Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)

Data Thinking Notes

Oct 24, 2022 · Big Data

How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques

This article explains the causes of Spark data skew, how to locate skewed tasks using the Web UI, and presents six optimization methods—including increasing shuffle parallelism, filtering abnormal keys, two‑stage aggregation, map‑join, key sampling, and random‑prefix joins—plus a real‑world case study.

Big DataData SkewJOIN

0 likes · 21 min read

How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques

DataFunTalk

Sep 15, 2022 · Big Data

Bilibili Offline Platform: Migration from Hive to Spark and Large‑Scale Optimizations

This article details Bilibili's evolution of its offline computing platform from Hadoop‑based Hive to Spark, describing the migration process, automated SQL conversion, result verification, stability and performance enhancements, meta‑store optimizations, and future work on remote shuffle and vectorized execution.

Data SkippingHiveMetaStore

0 likes · 28 min read

Bilibili Offline Platform: Migration from Hive to Spark and Large‑Scale Optimizations

IT Services Circle

Mar 21, 2022 · Big Data

Understanding Spark Shuffle: Hash, Sort, and Tungsten Sort Mechanisms

This article explains the evolution and inner workings of Spark's shuffle phase, comparing the original Hash‑based shuffle, the default Sort‑based shuffle, the optimized Tungsten‑Sort shuffle, and related configuration options that affect performance and file handling in large‑scale data processing.

Hash ShuffleShuffleSort-Shuffle

0 likes · 17 min read

Understanding Spark Shuffle: Hash, Sort, and Tungsten Sort Mechanisms

Architect

Jan 7, 2022 · Big Data

Spark Performance Optimization: Principles, Memory Model, Resource Tuning, Data Skew and Shuffle Tuning

This article provides an in‑depth guide to Spark performance optimization, covering the ten development principles, static and unified memory models, resource parameter tuning, data skew detection and mitigation techniques, as well as shuffle‑related configuration adjustments, supplemented with practical code examples and diagrams.

Data SkewMemory ModelPerformance Tuning

0 likes · 40 min read

Spark Performance Optimization: Principles, Memory Model, Resource Tuning, Data Skew and Shuffle Tuning

Big Data Technology & Architecture

Dec 23, 2021 · Big Data

Key Spark Configuration Parameters and Their Explanations

This article presents a comprehensive list of essential Spark configuration settings—including executor memory, off‑heap memory, memory fractions, shuffle options, and adaptive query execution parameters—each accompanied by a concise description to help users fine‑tune Spark performance.

Adaptive Query ExecutionBig DataMemory Management

0 likes · 6 min read

Key Spark Configuration Parameters and Their Explanations

Big Data Technology & Architecture

Dec 1, 2021 · Big Data

Understanding Spark Shuffle: Mechanisms, Evolution, and Optimization

This article provides a comprehensive overview of Spark's shuffle process, explaining its definition, internal mechanisms such as shuffle write and read, the evolution of shuffle managers, and practical optimization techniques including parameter tuning and broadcast variables, all aimed at improving performance in large‑scale data processing.

Big DataShuffleShuffle Reader

0 likes · 18 min read

Understanding Spark Shuffle: Mechanisms, Evolution, and Optimization

Big Data Technology & Architecture

Sep 17, 2021 · Big Data

Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process

This article explains HDFS reliability features such as replica policies, rack awareness, heartbeat, safe mode, checksums, trash, metadata protection and snapshots, then details YARN failover handling for ApplicationMaster, NodeManager and ResourceManager, and finally describes the Hadoop MapReduce shuffle workflow and tuning tips.

HDFSMapReduceReliability

0 likes · 13 min read

Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process

Big Data Technology & Architecture

Sep 16, 2021 · Big Data

Understanding Hadoop's Circular Buffer in the Shuffle Phase

This article explains how Hadoop's MapReduce shuffle uses a circular buffer data structure to store serialized key/value pairs and their metadata in memory, describes its initialization, write path, spill handling, and the underlying algorithms that ensure efficient in‑memory sorting and disk spilling.

HadoopIn-Memory BufferMapReduce

0 likes · 24 min read

Understanding Hadoop's Circular Buffer in the Shuffle Phase

Big Data Technology Architecture

Aug 24, 2021 · Big Data

Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting

This article presents a detailed collection of Spark performance‑tuning techniques—including submit‑script parameters, RDD and operator optimizations, parallelism and memory settings, broadcast variables, Kryo serialization, locality wait adjustments—as well as systematic methods for detecting and resolving data skew and common runtime issues such as shuffle failures, serialization errors, and JVM memory problems.

Data SkewJVM TuningShuffle

0 likes · 21 min read

Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting

Big Data Technology & Architecture

Jun 4, 2021 · Big Data

Comprehensive Spark Interview Questions and Answers

This article provides a detailed collection of Spark interview questions covering deployment modes, performance advantages over MapReduce, shuffle mechanisms, RDD characteristics, optimization techniques, resource management, and various practical aspects of Spark on YARN, Mesos, and Kubernetes.

OptimizationRDDShuffle

0 likes · 21 min read

Comprehensive Spark Interview Questions and Answers

Big Data Technology & Architecture

Apr 16, 2021 · Big Data

Spark Job Execution Architecture: From Submission to Shuffle and Task Processing

This article explains how Spark coordinates master, worker, driver, and executor components to generate, submit, and run jobs, detailing the creation of logical and physical execution graphs, task allocation, result handling, and the shuffle read process with code examples and diagrams.

Job ExecutionShuffleSpark

0 likes · 14 min read

Spark Job Execution Architecture: From Submission to Shuffle and Task Processing

dbaplus Community

Apr 14, 2021 · Big Data

Master Spark Performance: Key Tuning, Shuffle & Join Optimization

This guide compiles practical Spark tuning techniques, covering essential configuration parameters, programming best‑practices, detailed shuffle mechanics, and join optimization strategies, while also addressing common errors and mitigation steps, enabling developers to improve performance and resource utilization in large‑scale data processing jobs.

Big DataError handlingShuffle

0 likes · 25 min read

Master Spark Performance: Key Tuning, Shuffle & Join Optimization

Big Data Technology & Architecture

Apr 14, 2021 · Big Data

Understanding Spark Shuffle: Write and Read Mechanisms Compared to Hadoop MapReduce

This article explains how Spark implements shuffle write and shuffle read, compares its high‑level and low‑level processes with Hadoop MapReduce, and details the internal data structures, memory‑disk trade‑offs, and configuration options that affect performance.

MapReduceMemoryManagementRDD

0 likes · 21 min read

Understanding Spark Shuffle: Write and Read Mechanisms Compared to Hadoop MapReduce

Big Data Technology & Architecture

Apr 11, 2021 · Big Data

Understanding Spark RDD Logical Execution Graph and Dependency Types

This article explains how Spark builds the logical execution graph for RDDs, describes the four-step job processing pipeline, details the various dependency types such as NarrowDependency and ShuffleDependency, and reviews common transformations and their data‑flow characteristics.

RDDShuffleSpark

0 likes · 19 min read

Understanding Spark RDD Logical Execution Graph and Dependency Types

Architect

Apr 3, 2021 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article explains advanced Spark performance tuning techniques, focusing on diagnosing and resolving data skew and shuffle bottlenecks through stage analysis, key distribution inspection, and a variety of practical solutions such as Hive pre‑processing, key filtering, parallelism increase, two‑stage aggregation, map‑join, and combined strategies, while also covering ShuffleManager internals and related configuration parameters.

Big DataData SkewPerformance Tuning

0 likes · 47 min read

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

Architect

Apr 2, 2021 · Big Data

Spark Performance Optimization Guide: Development and Resource Tuning

This article provides a comprehensive guide to Spark performance optimization, covering development‑level tuning principles, resource configuration parameters, practical code examples, and best‑practice recommendations to achieve high‑throughput big‑data processing.

Big DataOptimizationPerformance

0 likes · 33 min read

Spark Performance Optimization Guide: Development and Resource Tuning

Big Data Technology Architecture

Mar 10, 2021 · Big Data

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Solutions, and Shuffle Tuning

This guide presents a complete Spark performance optimization handbook covering development‑time best practices, resource‑parameter tuning, detailed data‑skew detection and mitigation techniques, advanced shuffle‑engine configurations, and practical code examples to help engineers build faster, more reliable Spark jobs.

Data SkewResource TuningShuffle

0 likes · 69 min read

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Solutions, and Shuffle Tuning

Laravel Tech Community

Dec 29, 2020 · Backend Development

PHP shuffle() Function – Randomly Shuffle an Array

This article explains the PHP shuffle() function, describing its purpose of randomly reordering array elements, the required array parameter, the boolean return value, and provides a complete example with sample output to illustrate its usage.

ArrayPHPShuffle

0 likes · 2 min read

PHP shuffle() Function – Randomly Shuffle an Array

Big Data Technology & Architecture

Dec 29, 2020 · Big Data

Spark Performance Tuning: Common Parameters, Programming Tips, Shuffle and Join Optimization

This article provides a comprehensive guide to Spark performance tuning, covering essential configuration parameters, best‑practice programming recommendations, detailed shuffle mechanics, join optimization strategies, and common error troubleshooting for big‑data workloads.

JOINOptimizationPerformance Tuning

0 likes · 20 min read

Spark Performance Tuning: Common Parameters, Programming Tips, Shuffle and Join Optimization

Programmer DD

Nov 27, 2020 · Fundamentals

7 Tiny Code Gems That Pack Massive Power: From Shuffle to Fast Inverse Square Root

This article showcases seven ultra‑compact yet powerful code examples—from a zero‑code deployment tool and a two‑line shuffle algorithm to sleep sort, a one‑line Python AI snippet, a simple tomorrow‑time sleep call, the legendary fast inverse square‑root constant, and the classic hello‑world program.

Fast Inverse Square RootShufflealgorithms

0 likes · 6 min read

7 Tiny Code Gems That Pack Massive Power: From Shuffle to Fast Inverse Square Root

ITPUB

Nov 16, 2020 · Fundamentals

7 Unexpected Code Hacks: No‑Code Deployment, Shuffle, Sleep Sort, AI One‑Liner & More

This article showcases seven intriguing code tricks—from a zero‑code deployment project and a concise shuffle algorithm to a sleep‑sort implementation, a one‑line AI chatbot, a simple next‑day timer, the legendary fast inverse square‑root constant, and the classic hello‑world example—each illustrated with brief explanations and runnable snippets.

Fast Inverse Square RootJavaPython

0 likes · 6 min read

7 Unexpected Code Hacks: No‑Code Deployment, Shuffle, Sleep Sort, AI One‑Liner & More

Big Data Technology & Architecture

Jul 16, 2020 · Big Data

Spark Configuration Parameters and Performance Tuning Guidelines

This article explains the purpose, default values, and practical tuning recommendations for common Spark submit options such as executor counts, memory settings, shuffle behavior, speculation, and various Spark SQL configurations to help users optimize big‑data workloads.

Big DataConfigurationExecutor

0 likes · 14 min read

Spark Configuration Parameters and Performance Tuning Guidelines

Laravel Tech Community

Jun 12, 2020 · Backend Development

PHP shuffle() Function: Randomly Reordering Arrays with Detailed Examples

This article explains the PHP shuffle() function, its syntax, parameters, return values, and demonstrates three practical examples showing how to randomize indexed and associative arrays, highlighting the effect on array keys and values after shuffling.

ArrayShufflerandom

0 likes · 3 min read

PHP shuffle() Function: Randomly Reordering Arrays with Detailed Examples

Big Data Technology Architecture

Apr 28, 2020 · Big Data

Understanding Shuffle in Hadoop MapReduce and Spark

This article explains the concept and workflow of shuffle in Hadoop MapReduce and Spark, covering map‑side buffering, spill and merge, reduce‑side copy‑merge‑reduce, the reasons for sorting and file merging, and compares Hash‑Shuffle and Sort‑Shuffle implementations with performance considerations.

Hash ShuffleShuffleSort-Shuffle

0 likes · 16 min read

Understanding Shuffle in Hadoop MapReduce and Spark

Big Data Technology & Architecture

Mar 31, 2020 · Big Data

Comprehensive Spark Optimization Guide: Development, Resource, Skew, Shuffle, and Additional Tips

This article presents a detailed summary of Meituan's Spark optimization techniques, covering development‑level RDD tuning, resource parameter configuration, data‑skew mitigation, shuffle improvements, and the advantages of using DataFrame/Dataset APIs for better performance.

Big DataOptimizationPerformance Tuning

0 likes · 12 min read

Comprehensive Spark Optimization Guide: Development, Resource, Skew, Shuffle, and Additional Tips

dbaplus Community

Mar 23, 2020 · Big Data

How to Detect and Resolve Data Skew in Spark and Hadoop

This article explains what data skew is in distributed big‑data systems like Spark and Hadoop, why it hurts performance, how to spot it using the Web UI or key statistics, and presents eight practical mitigation techniques ranging from filtering and shuffle parallelism to custom partitioners and broadcast joins.

Broadcast JoinData SkewHadoop

0 likes · 19 min read

How to Detect and Resolve Data Skew in Spark and Hadoop

Big Data Technology & Architecture

Feb 9, 2020 · Big Data

Understanding Hadoop's Circular Buffer in the Shuffle Phase

This article explains how Hadoop's MapReduce shuffle uses a circular buffer to store serialized key/value pairs and their metadata, detailing its structure, initialization, write path, spill logic, and the background thread that sorts and writes data to disk.

Big DataHadoopJava

0 likes · 24 min read

Big Data Technology & Architecture

Jan 30, 2020 · Big Data

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

This article provides an in‑depth, step‑by‑step guide to optimizing Spark jobs, covering development‑time best practices, resource‑parameter tuning, data‑skew detection and mitigation techniques, and shuffle‑stage performance tweaks, complete with Scala code examples and practical recommendations.

Big DataData SkewPerformance Optimization

0 likes · 67 min read

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

vivo Internet Technology

Dec 25, 2019 · Big Data

Understanding and Mitigating Data Skew in Spark and Hadoop

Data skew in Spark and Hadoop occurs when a few keys dominate shuffle traffic, causing slow tasks, OOM errors, and job failures; the article describes how to detect skew via UI metrics or sampling and offers mitigation tactics such as filtering keys, increasing shuffle partitions, custom partitioners, broadcast joins, salted keys, and Hadoop‑specific settings.

Data SkewPerformance OptimizationShuffle

0 likes · 18 min read

Understanding and Mitigating Data Skew in Spark and Hadoop

Big Data Technology & Architecture

Nov 3, 2019 · Big Data

Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis

This article explains the evolution of Spark Shuffle from hash‑based to sort‑based, introduces the Smart Shuffle optimization, details their implementations and configurations, and presents performance comparisons using TPC‑DS benchmarks, highlighting significant speedups and reduced I/O overhead.

Big DataShuffleSmart Shuffle

0 likes · 7 min read

Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis

Big Data Technology & Architecture

Jun 9, 2019 · Big Data

Optimizing Spark Shuffle: Can Fetch, Efficient Fetch, and Reliable Fetch

This article analyzes three Spark shuffle bottlenecks—oversized partitions that exceed Netty's 2 GB limit, excessive retry latency caused by dead executors, and insufficient data‑corruption checks—and presents concrete configuration changes, new block identifiers, executor‑liveness checks, and CRC‑32 verification to improve fetchability, efficiency, and reliability at scale.

Big DataShuffleSpark

0 likes · 18 min read

Optimizing Spark Shuffle: Can Fetch, Efficient Fetch, and Reliable Fetch

Big Data Technology & Architecture

May 30, 2019 · Big Data

Data Skew Optimization Techniques in Spark

This article explains the phenomenon, causes, detection methods, and a comprehensive set of solutions—including Hive preprocessing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, sampling, random prefixing, and combined strategies—to mitigate data skew in Spark jobs and improve performance.

Big DataData SkewShuffle

0 likes · 31 min read

Data Skew Optimization Techniques in Spark

Big Data Technology & Architecture

May 28, 2019 · Big Data

Optimizing Flink Shuffle: New Flow‑Control Mechanism, Serialization Improvements, and Architecture Refactoring

The article explains how Flink's shuffle pipeline—from upstream data serialization to downstream consumption—is optimized through a credit‑based flow‑control mechanism, zero‑copy network buffers, broadcast serialization reduction, external shuffle service, and a plugin‑based shuffle manager, resulting in significant performance gains for both streaming and batch jobs.

Big DataFlinkFlow Control

0 likes · 15 min read

Optimizing Flink Shuffle: New Flow‑Control Mechanism, Serialization Improvements, and Architecture Refactoring

Big Data Technology & Architecture

Apr 2, 2019 · Big Data

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

The article explains Hadoop's MapReduce framework as both a programming model and execution engine, detailing its map and reduce phases, the WordCount example code, job startup components, data shuffling, partitioning, and how large‑scale distributed computations are orchestrated across a cluster.

Big DataDistributed ComputingHadoop

0 likes · 10 min read

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

58 Tech

Mar 15, 2019 · Big Data

Optimizing Spark Join Operations in Spark Core and Spark SQL

This article explains how to improve Spark join performance by reducing shuffle, using appropriate partitioners, applying broadcast hash joins for small tables, and selecting the optimal join strategy (broadcast, shuffle hash, or sort‑merge) in both Spark Core and Spark SQL.

JOINOptimizationShuffle

0 likes · 6 min read

Optimizing Spark Join Operations in Spark Core and Spark SQL

Youzan Coder

Mar 8, 2019 · Big Data

Why Spark Shuffle Often Runs Out of Memory and How to Fix It

This article examines Spark's memory management and the shuffle process, identifies the components that consume the most memory during shuffle write and read, analyzes common OOM scenarios such as task concurrency and data skew, and offers configuration tips to prevent out‑of‑memory failures.

MemoryManagementOutOfMemoryPerformance

0 likes · 14 min read

Why Spark Shuffle Often Runs Out of Memory and How to Fix It

Sohu Tech Products

Feb 13, 2019 · Big Data

Evolution and Implementation Details of Spark Shuffle Mechanisms

This article examines the historical evolution of Spark's shuffle implementations—from early Hash‑Based Shuffle to modern SortShuffleWriter, BypassMergeSortShuffleWriter, and UnsafeShuffleWriter—explaining their design choices, selection criteria, and the corresponding shuffle reader architecture in a production‑grade Spark 2.1.1 environment.

Big DataDistributed ComputingShuffle

0 likes · 13 min read

Evolution and Implementation Details of Spark Shuffle Mechanisms

21CTO

May 17, 2018 · Big Data

Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling

This article explains Hadoop's core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker architecture to improve scalability and resource utilization in large‑scale data processing clusters.

Distributed ComputingHadoopShuffle

0 likes · 12 min read

Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling

ITPUB

Mar 29, 2018 · Big Data

Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture

This article explains Hadoop’s core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker design to improve scalability and resource utilization in large‑scale data processing clusters.

Big DataHadoopMapReduce

0 likes · 15 min read

Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture

dbaplus Community

Aug 21, 2017 · Big Data

How to Tackle Spark Data Skew: Practical Solutions and Real‑World Examples

This article explains what Spark data skew is, why it hurts performance, and presents six practical mitigation techniques—including adjusting parallelism, custom partitioners, map‑side joins, and adding random prefixes—backed by detailed experiments, code snippets, and performance comparisons.

Data SkewMap-side JoinPartitioner

0 likes · 18 min read

How to Tackle Spark Data Skew: Practical Solutions and Real‑World Examples

37 Interactive Technology Team

Jun 13, 2017 · Big Data

MapReduce Principles and Hadoop Execution Process with WordCount Example

The article explains MapReduce’s divide‑and‑conquer model and Hadoop’s execution pipeline—including map, partition, spill, merge, shuffle, and reduce phases—illustrated with a WordCount example that shows how mappers emit word‑1 pairs and reducers aggregate counts to produce final frequencies on HDFS.

Distributed ComputingHadoopMapReduce

0 likes · 7 min read

MapReduce Principles and Hadoop Execution Process with WordCount Example

Architecture Digest

May 4, 2016 · Big Data

Upgrading Spark from 1.4.1 to 1.6.1: Memory, Storage, and Operational Challenges

The article details the author’s experience upgrading a production Spark cluster from version 1.4.1 to 1.6.1, exposing memory‑spill, unified memory, BlockManager deadlock, Yarn‑kill, UI quirks, and Spark‑SQL compatibility issues, and proposes concrete code‑level fixes for each problem.

Big DataDistributed ComputingMemory Management

0 likes · 14 min read

Upgrading Spark from 1.4.1 to 1.6.1: Memory, Storage, and Operational Challenges

Baidu Tech Salon

Jan 13, 2015 · Big Data

Inside Spark 1.2: New APIs, In‑Memory Columnar Storage, and Baidu’s High‑Performance Shuffle

This article reviews Spark 1.2’s major enhancements—including the External Data Source API, column pruning, predicate pushdown, and in‑memory columnar storage—while also detailing Baidu’s large‑scale Spark deployments, its custom high‑performance Shuffle service, and the integration of Spark with the Tachyon memory file system.

BaiduBig DataExternal Data Source API

0 likes · 16 min read

Inside Spark 1.2: New APIs, In‑Memory Columnar Storage, and Baidu’s High‑Performance Shuffle