Tagged articles
49 articles
Page 1 of 1
JD Tech
JD Tech
Apr 23, 2026 · Backend Development

How JD Upgraded Its B‑Side Order Storage Architecture to Tackle Elasticsearch High‑Concurrency Pressure

Facing explosive merchant growth and soaring order volumes, JD redesigned its B‑side POP order storage by isolating large tenants, applying double‑hash routing, expanding clusters, buffering updates, and automating data archiving, ultimately delivering a high‑performance, scalable Elasticsearch platform that sustains massive traffic spikes.

Backend ArchitectureData SkewElasticsearch
0 likes · 16 min read
How JD Upgraded Its B‑Side Order Storage Architecture to Tackle Elasticsearch High‑Concurrency Pressure
Java Architect Handbook
Java Architect Handbook
Apr 6, 2026 · Databases

Why MySQL Indexes Still Slow Queries and How to Fix Them

This guide explains the six common reasons why MySQL indexes may fail to improve query speed, shows how interviewers evaluate index knowledge, and provides concrete SQL examples, EXPLAIN analysis, and practical optimization techniques such as redesigning indexes, using covering indexes, avoiding implicit type conversion, and tuning database configuration.

Data SkewDatabase InterviewIndex Optimization
0 likes · 15 min read
Why MySQL Indexes Still Slow Queries and How to Fix Them
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 31, 2024 · Big Data

Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables

This article explains how Spark ≥ 3.3 introduces Storage Partitioned Join (SPJ) to avoid costly shuffle operations when joining partitioned V2 source tables such as Apache Iceberg, detailing the required conditions, configuration settings, practical code examples, and various join scenarios including mismatched partitions and data skew.

BucketingData SkewShuffle Optimization
0 likes · 15 min read
Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables
DaTaobao Tech
DaTaobao Tech
Jun 21, 2024 · Big Data

Flink Real-Time Data Development: Cases on Data Skew, Watermark Failure, and GroupBy Issues

The article walks through three Flink streaming pitfalls—data‑skew‑induced back‑pressure, lost watermarks after interval joins, and ineffective group‑by causing duplicate rows—and shows how to resolve them with two‑stage distinct aggregation, hash‑based key distribution, processing‑time windows or split jobs, and mini‑batch buffering.

Data SkewFlinkReal-Time
0 likes · 14 min read
Flink Real-Time Data Development: Cases on Data Skew, Watermark Failure, and GroupBy Issues
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 30, 2024 · Big Data

Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew

This article explores practical SQL optimization techniques for Alibaba's ODPS platform, covering fundamentals, common pitfalls like null handling and select *, advanced strategies such as multi‑insert, partition limiting, UDF placement, data‑skew mitigation, parameter tuning, and real‑world case studies that dramatically reduce query runtimes.

Big DataData SkewMaxCompute
0 likes · 23 min read
Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 16, 2023 · Big Data

Optimizing Big Data SQL: Handling Data Skew and Data Explosion

This article examines common performance issues in big data SQL queries, such as data skew and data explosion, and provides systematic troubleshooting steps and practical optimization techniques across the Map, Reduce, and Join stages, including partition merging, column pruning, predicate pushdown, and join strategies.

Data ExplosionData Skewdistributed computing
0 likes · 10 min read
Optimizing Big Data SQL: Handling Data Skew and Data Explosion
JD Tech
JD Tech
Jun 14, 2023 · Big Data

Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)

This article explains the concept of data skew in offline big‑data jobs, describes its symptoms and root causes, and provides practical optimization techniques for Hive and Spark—including partitioning strategies, map‑join usage, adaptive query settings, and monitoring approaches—to prevent performance degradation and runtime failures.

Data SkewShuffleSpark
0 likes · 17 min read
Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)
JD Retail Technology
JD Retail Technology
Apr 14, 2023 · Big Data

Understanding Data Skew and Its Mitigation in Hive and Spark

This article explains the concept of data skew, its symptoms such as slow tasks and OOM errors, and provides comprehensive mitigation techniques and configuration examples for Hive and Spark, including custom partitioning, map joins, adaptive execution, and key detection methods.

Adaptive ExecutionBig DataData Skew
0 likes · 15 min read
Understanding Data Skew and Its Mitigation in Hive and Spark
JD Cloud Developers
JD Cloud Developers
Apr 4, 2023 · Databases

How to Scale B‑Token Systems with Horizontal Sharding and Consistent Hashing

This article examines the challenges of growing B‑token data volumes, including table size limits and data skew, and proposes a solution using horizontal sharding with a consistent‑hash ring, dynamic table allocation, water‑level thresholds, periodic archiving, and monitoring to support future growth without costly migrations.

Data Skewconsistent hashingscalable architecture
0 likes · 13 min read
How to Scale B‑Token Systems with Horizontal Sharding and Consistent Hashing
Data Thinking Notes
Data Thinking Notes
Dec 21, 2022 · Big Data

Why Your Spark Batch Job Fails: Memory Limits, Data Skew, and Practical Fixes

This article examines a recurring Spark batch task failure caused by OutOfMemory errors and data skew, details the investigation steps—including increasing executor memory, raising parallelism, and analyzing shuffle metrics—and proposes solutions such as data validation, filtering oversized keys, and memory adjustments.

Batch ProcessingData SkewOutOfMemory
0 likes · 4 min read
Why Your Spark Batch Job Fails: Memory Limits, Data Skew, and Practical Fixes
Data Thinking Notes
Data Thinking Notes
Nov 22, 2022 · Big Data

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.

Big DataData SkewRDS
0 likes · 5 min read
Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It
Data Thinking Notes
Data Thinking Notes
Oct 24, 2022 · Big Data

How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques

This article explains the causes of Spark data skew, how to locate skewed tasks using the Web UI, and presents six optimization methods—including increasing shuffle parallelism, filtering abnormal keys, two‑stage aggregation, map‑join, key sampling, and random‑prefix joins—plus a real‑world case study.

Big DataData SkewJOIN
0 likes · 21 min read
How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques

Understanding Data Skew and Its Mitigation Strategies in Distributed Computing

The article explains what data skew is in distributed computing, analyzes its logical and data‑level causes, and presents preventive and remedial techniques such as data partitioning, logical replacement, two‑stage aggregation, increasing parallelism, and data cleaning to improve processing efficiency.

Data SkewSparkperformance optimization
0 likes · 8 min read
Understanding Data Skew and Its Mitigation Strategies in Distributed Computing
DaTaobao Tech
DaTaobao Tech
Sep 6, 2022 · Big Data

SQL Optimization Techniques for ODPS (Open Data Processing Service)

The article presents practical ODPS SQL optimization strategies—including explicit column selection, partition limiting, multi‑insert, proper handling of nulls, join‑type choices, map‑join and skew hints, bucketed tables, and tuned task parameters—illustrated with three real‑world cases that dramatically cut execution time and resource usage.

Big DataData SkewODPS
0 likes · 23 min read
SQL Optimization Techniques for ODPS (Open Data Processing Service)
DataFunTalk
DataFunTalk
Jun 28, 2022 · Big Data

JD Retail Traffic Data Warehouse Architecture and Processing Practices

This article presents a comprehensive technical overview of JD.com’s retail traffic data processing pipeline, detailing the multi‑layer data warehouse architecture, real‑time and offline data flows, a large‑scale back‑fill case using Iceberg and OLAP, data‑skew detection and mitigation techniques, and future directions involving unified Flink‑Spark streaming‑batch solutions.

Data SkewFlinkIceberg
0 likes · 12 min read
JD Retail Traffic Data Warehouse Architecture and Processing Practices
JD Retail Technology
JD Retail Technology
Jan 27, 2022 · Big Data

How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs

This article explains JD’s self‑developed data‑skew mitigation solution for Spark, detailing the problem of uneven key distribution, the limitations of the open‑source AQE implementation, and JD’s OptimizeSkewedJoinV2 algorithm that dramatically reduces stage latency in large‑scale join workloads.

Adaptive Query ExecutionBig DataData Skew
0 likes · 13 min read
How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs
Architect
Architect
Jan 7, 2022 · Big Data

Spark Performance Optimization: Principles, Memory Model, Resource Tuning, Data Skew and Shuffle Tuning

This article provides an in‑depth guide to Spark performance optimization, covering the ten development principles, static and unified memory models, resource parameter tuning, data skew detection and mitigation techniques, as well as shuffle‑related configuration adjustments, supplemented with practical code examples and diagrams.

Data SkewMemory ModelShuffle
0 likes · 40 min read
Spark Performance Optimization: Principles, Memory Model, Resource Tuning, Data Skew and Shuffle Tuning
Big Data Technology Architecture
Big Data Technology Architecture
Aug 24, 2021 · Big Data

Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting

This article presents a detailed collection of Spark performance‑tuning techniques—including submit‑script parameters, RDD and operator optimizations, parallelism and memory settings, broadcast variables, Kryo serialization, locality wait adjustments—as well as systematic methods for detecting and resolving data skew and common runtime issues such as shuffle failures, serialization errors, and JVM memory problems.

Data SkewShuffleSpark
0 likes · 21 min read
Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting
Laravel Tech Community
Laravel Tech Community
May 9, 2021 · Backend Development

Understanding Consistent Hashing: From Simple Modulo Hash to Optimizations

This article explains the drawbacks of a basic modulo hash algorithm for key distribution, demonstrates how consistent hashing resolves scaling and node‑failure issues, and discusses virtual‑node techniques to mitigate data skew and improve load balancing in distributed cache systems.

Data Skewconsistent hashingdistributed caching
0 likes · 5 min read
Understanding Consistent Hashing: From Simple Modulo Hash to Optimizations
Architect
Architect
Apr 3, 2021 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article explains advanced Spark performance tuning techniques, focusing on diagnosing and resolving data skew and shuffle bottlenecks through stage analysis, key distribution inspection, and a variety of practical solutions such as Hive pre‑processing, key filtering, parallelism increase, two‑stage aggregation, map‑join, and combined strategies, while also covering ShuffleManager internals and related configuration parameters.

Big DataData SkewScala
0 likes · 47 min read
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning
Big Data Technology Architecture
Big Data Technology Architecture
Apr 1, 2021 · Big Data

Spark Adaptive Execution: Dynamic Shuffle Partition, Broadcast Join, and Skew Handling

The article explains the limitations of static shuffle partitions, execution‑plan estimation, and data skew in Spark SQL, and describes how Spark Adaptive Execution can automatically adjust shuffle partition numbers, switch join strategies, and mitigate skew through configurable parameters and code examples.

Adaptive ExecutionBroadcast JoinData Skew
0 likes · 11 min read
Spark Adaptive Execution: Dynamic Shuffle Partition, Broadcast Join, and Skew Handling
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 30, 2021 · Big Data

Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how Soul's data engineering team replaced nightly batch ETL with real-time Delta Lake ingestion on EMR, detailing the motivations, comparative analysis of Delta, Hudi, Iceberg, the implementation architecture, encountered issues such as data skew and schema evolution, and the solutions adopted to improve performance and reliability.

Data LakeData SkewDelta Lake
0 likes · 13 min read
Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 18, 2021 · Big Data

Flink Job Troubleshooting and Performance Optimization: Data Skew, Kafka Configuration, Resource Management, and Checkpoint Issues

This article details common Flink streaming problems such as data skew causing task back‑pressure, oversized Kafka messages, high‑throughput ack settings, slot removal errors, checkpoint timeouts, and resource constraints, and provides concrete configuration changes and architectural adjustments to resolve them.

CheckpointData SkewFlink
0 likes · 18 min read
Flink Job Troubleshooting and Performance Optimization: Data Skew, Kafka Configuration, Resource Management, and Checkpoint Issues
Big Data Technology Architecture
Big Data Technology Architecture
Mar 10, 2021 · Big Data

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Solutions, and Shuffle Tuning

This guide presents a complete Spark performance optimization handbook covering development‑time best practices, resource‑parameter tuning, detailed data‑skew detection and mitigation techniques, advanced shuffle‑engine configurations, and practical code examples to help engineers build faster, more reliable Spark jobs.

Data SkewResource TuningShuffle
0 likes · 69 min read
Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Solutions, and Shuffle Tuning
vivo Internet Technology
vivo Internet Technology
Nov 11, 2020 · Big Data

Understanding Distributed Hash Tables (DHT) and Their Improvements

The article explains how Distributed Hash Tables replace simple modulo hashing with a ring‑based scheme, demonstrates severe data skew in basic implementations, and shows that adding multiple virtual nodes plus a load‑boundary factor dramatically balances storage and request distribution across cluster nodes.

DHTData SkewDistributed Hash Table
0 likes · 9 min read
Understanding Distributed Hash Tables (DHT) and Their Improvements
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 15, 2020 · Big Data

Hive Optimization Techniques and Best Practices for Big Data Processing

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, tuning MapReduce jobs, handling data skew in joins, and selecting appropriate storage formats for large‑scale data warehouses.

Big DataData SkewHiveQL
0 likes · 19 min read
Hive Optimization Techniques and Best Practices for Big Data Processing
dbaplus Community
dbaplus Community
Mar 23, 2020 · Big Data

How to Detect and Resolve Data Skew in Spark and Hadoop

This article explains what data skew is in distributed big‑data systems like Spark and Hadoop, why it hurts performance, how to spot it using the Web UI or key statistics, and presents eight practical mitigation techniques ranging from filtering and shuffle parallelism to custom partitioners and broadcast joins.

Broadcast JoinData SkewHadoop
0 likes · 19 min read
How to Detect and Resolve Data Skew in Spark and Hadoop
Big Data Technology Architecture
Big Data Technology Architecture
Mar 21, 2020 · Big Data

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Handling, and Shuffle Tuning

This article presents a complete guide to Spark performance optimization, covering development‑time best practices, resource‑parameter tuning, systematic detection and resolution of data skew, and detailed shuffle‑related parameter adjustments, all illustrated with Scala code examples.

Data SkewSparkperformance optimization
0 likes · 67 min read
Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Handling, and Shuffle Tuning
Big Data Technology Architecture
Big Data Technology Architecture
Mar 19, 2020 · Big Data

Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations

Data skew in Hive MapReduce jobs, caused by uneven key distribution during joins, group‑by, or COUNT(DISTINCT) operations, can severely slow tasks, and the article explains common scenarios and practical solutions such as using MapJoin, enabling map‑side aggregation, load‑balancing, and rewriting queries to mitigate skew.

Data SkewMapJoinMapReduce
0 likes · 7 min read
Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 13, 2020 · Big Data

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

This article describes how eBay's Central Application Logging (CAL) system generates massive daily logs, the challenges of Hadoop MapReduce job performance and resource consumption, and the step‑by‑step optimizations—reducing GC time, mitigating data skew, and improving algorithms—that cut execution time by over 60%, lowered cluster resource usage, and raised job success rates to nearly 100%.

Big DataData SkewHadoop
0 likes · 11 min read
Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 30, 2020 · Big Data

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

This article provides an in‑depth, step‑by‑step guide to optimizing Spark jobs, covering development‑time best practices, resource‑parameter tuning, data‑skew detection and mitigation techniques, and shuffle‑stage performance tweaks, complete with Scala code examples and practical recommendations.

Big DataData SkewResource Tuning
0 likes · 67 min read
Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)
vivo Internet Technology
vivo Internet Technology
Dec 25, 2019 · Big Data

Understanding and Mitigating Data Skew in Spark and Hadoop

Data skew in Spark and Hadoop occurs when a few keys dominate shuffle traffic, causing slow tasks, OOM errors, and job failures; the article describes how to detect skew via UI metrics or sampling and offers mitigation tactics such as filtering keys, increasing shuffle partitions, custom partitioners, broadcast joins, salted keys, and Hadoop‑specific settings.

Data SkewPartitioningShuffle
0 likes · 18 min read
Understanding and Mitigating Data Skew in Spark and Hadoop
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 14, 2019 · Big Data

Optimizing Spark PageRank: Cache, Checkpoint, Data Skew, and Resource Utilization

This article presents a comprehensive analysis of Spark PageRank performance, detailing the algorithm's basics, the original example code, and four key optimizations—caching with checkpointing, memory‑efficient data structures, handling data skew, and maximizing executor and driver resource usage—backed by experimental results and practical recommendations.

Big DataCacheCheckpoint
0 likes · 18 min read
Optimizing Spark PageRank: Cache, Checkpoint, Data Skew, and Resource Utilization
Big Data Technology & Architecture
Big Data Technology & Architecture
May 30, 2019 · Big Data

Data Skew Optimization Techniques in Spark

This article explains the phenomenon, causes, detection methods, and a comprehensive set of solutions—including Hive preprocessing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, sampling, random prefixing, and combined strategies—to mitigate data skew in Spark jobs and improve performance.

Big DataData SkewShuffle
0 likes · 31 min read
Data Skew Optimization Techniques in Spark
Alibaba Cloud Developer
Alibaba Cloud Developer
May 23, 2019 · Big Data

How Blink Powers Alibaba’s Real‑Time Supply‑Chain Data Warehouse

This article explains how Alibaba's Blink engine tackles the complex challenges of building a real‑time supply‑chain data warehouse—covering retroduction, dimension‑table joins, data skew, timeout statistics, zero‑point optimizations, and future directions—through SQL‑based stream processing and intelligent resource tuning.

Data SkewDimension joinFlink
0 likes · 14 min read
How Blink Powers Alibaba’s Real‑Time Supply‑Chain Data Warehouse
dbaplus Community
dbaplus Community
Mar 27, 2019 · Big Data

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

This article explains how eBay's CAL team reduced Hadoop MapReduce job execution time and resource consumption by over 60% through targeted GC tuning, data‑skew mitigation, and algorithmic improvements, boosting job success rates to nearly 100% while handling petabyte‑scale log data.

Big DataData SkewGC tuning
0 likes · 12 min read
How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization
ITPUB
ITPUB
Dec 1, 2017 · Databases

How to Shrink Oracle Indexes for Skewed Columns Using Function Indexes

This article explains why conventional indexes waste space and perform poorly on highly skewed columns, introduces a decode‑based function index that excludes high‑frequency values, details the experimental setup with millions of rows, compares index size and query performance, and outlines the method's limitations.

Data SkewFunction IndexOracle
0 likes · 10 min read
How to Shrink Oracle Indexes for Skewed Columns Using Function Indexes
dbaplus Community
dbaplus Community
Aug 21, 2017 · Big Data

How to Tackle Spark Data Skew: Practical Solutions and Real‑World Examples

This article explains what Spark data skew is, why it hurts performance, and presents six practical mitigation techniques—including adjusting parallelism, custom partitioners, map‑side joins, and adding random prefixes—backed by detailed experiments, code snippets, and performance comparisons.

Data SkewMap-side JoinPartitioner
0 likes · 18 min read
How to Tackle Spark Data Skew: Practical Solutions and Real‑World Examples
Architecture Digest
Architecture Digest
Apr 24, 2017 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical symptoms, and presents practical strategies—including business‑level adjustments, code tweaks, and platform‑specific tuning—to mitigate and resolve skew in big‑data processing.

Big DataData SkewHadoop
0 likes · 11 min read
Understanding and Solving Data Skew in Hadoop and Spark
Architecture Digest
Architecture Digest
May 25, 2016 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article provides a comprehensive guide on tackling Spark performance bottlenecks by diagnosing data skew, locating the offending stages and operators, and applying a range of practical solutions—including Hive pre‑processing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, and combined strategies—followed by an in‑depth discussion of shuffle manager evolution and key configuration parameters for fine‑tuning.

Big DataData SkewShuffle Optimization
0 likes · 35 min read
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning
Meituan Technology Team
Meituan Technology Team
May 13, 2016 · Big Data

Spark Performance Optimization Guide: Data Skew and Shuffle Tuning

This advanced Spark performance guide explains how data skew arises during shuffles and presents eight practical solutions—including Hive preprocessing, key filtering, increased shuffle parallelism, two‑stage aggregation, map joins, sampling, random prefixes, and combined strategies—while also detailing key shuffle‑tuning parameters such as spark.shuffle.file.buffer, spark.reducer.maxSizeInFlight, and spark.shuffle.manager to improve memory usage and execution speed.

Big DataData SkewShuffle Tuning
0 likes · 33 min read
Spark Performance Optimization Guide: Data Skew and Shuffle Tuning