Past Memory Big Data
Author

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

58
Articles
0
Likes
22
Views
0
Comments
Recent Articles

Latest from Past Memory Big Data

58 recent articles
Past Memory Big Data
Past Memory Big Data
Dec 27, 2024 · Big Data

How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet

Uber’s data lake on Hadoop stores hundreds of petabytes in Parquet files and, by adopting ZSTD compression, column pruning, and column reordering, achieves up to 79% storage reduction and significant vCore savings, with detailed benchmarks guiding optimal compression levels and open‑source contributions.

Apache ParquetBig DataCompression
0 likes · 14 min read
How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet
Past Memory Big Data
Past Memory Big Data
Dec 26, 2024 · Big Data

Eliminate Shuffle: Deep Dive into Spark’s Storage Partition Join (SPJ)

This article explains how Spark ≥ 3.3’s Storage Partition Join (SPJ) can avoid costly shuffle operations by using Iceberg tables, outlines the required table properties and Spark configurations, demonstrates the effect with code examples and execution plans, and explores several realistic join scenarios.

Apache IcebergBig DataSPJ
0 likes · 16 min read
Eliminate Shuffle: Deep Dive into Spark’s Storage Partition Join (SPJ)
Past Memory Big Data
Past Memory Big Data
Dec 24, 2024 · Big Data

Magnet: A Push‑Based Shuffle Service that Scales to Petabyte‑Level Data Processing

LinkedIn’s massive Spark workloads suffer from shuffle bottlenecks caused by tiny shuffle blocks, unreliable RPC connections, and data skew, so the authors design Magnet—a push‑merge shuffle service that merges blocks into large chunks, improves disk I/O, tolerates failures, and cuts end‑to‑end job time by nearly 30% regardless of hardware.

Disk I/O optimizationLarge‑scale data processingPerformance evaluation
0 likes · 56 min read
Magnet: A Push‑Based Shuffle Service that Scales to Petabyte‑Level Data Processing
Past Memory Big Data
Past Memory Big Data
Nov 8, 2024 · Big Data

How Spark on Kubernetes Transformed Duodian DMALL’s Big Data Platform

The article details Duodian DMALL’s migration from a traditional Hadoop stack to a cloud‑native Spark‑on‑Kubernetes architecture, explaining the motivations, design choices, component selections, operational challenges, and lessons learned through concrete examples and performance observations.

Apache CelebornBig DataFluent Bit
0 likes · 21 min read
How Spark on Kubernetes Transformed Duodian DMALL’s Big Data Platform
Past Memory Big Data
Past Memory Big Data
Sep 13, 2024 · Backend Development

How Didi Scales Online Search with Elasticsearch: Architecture, Performance, and Stability

The article details Didi's comprehensive use of Elasticsearch across all online retrieval scenarios, covering its physical‑machine architecture, gateway and control layers, data synchronization methods, cross‑datacenter replication, JDK17 + ZGC performance upgrades, cost‑saving ZSTD compression, multi‑tenant isolation, custom security, and ongoing stability practices leading to a planned upgrade to Elasticsearch 8.13.

Cross‑Datacenter ReplicationDidiElasticsearch
0 likes · 16 min read
How Didi Scales Online Search with Elasticsearch: Architecture, Performance, and Stability
Past Memory Big Data
Past Memory Big Data
Jun 27, 2024 · Big Data

Inside Presto 2.0: The Native C++ Query Engine Explained

This article provides a detailed technical overview of Presto 2.0, the native C++ query engine built on the Velox library, covering its motivation, vectorized architecture, memory management, performance benchmarks from Meta and IBM, and deployment practices for large‑scale data warehouses.

Big DataC++Data Warehouse
0 likes · 15 min read
Inside Presto 2.0: The Native C++ Query Engine Explained
Past Memory Big Data
Past Memory Big Data
Jun 20, 2024 · Big Data

How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox

This article details Meituan's production‑grade adoption of Spark vectorized execution via the open‑source Gluten and Velox stack, explaining SIMD fundamentals, performance motivations, the end‑to‑end integration workflow, staged rollout, encountered challenges, and the resulting resource savings and speedups.

Big DataGlutenORC
0 likes · 33 min read
How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox
Past Memory Big Data
Past Memory Big Data
Jun 6, 2024 · Operations

How Uber Tuned GC to Boost Presto Cluster Stability

Uber runs over 20 Presto clusters serving more than 500,000 daily queries, but frequent full GCs and OOMs threatened stability; by analyzing G1GC behavior and adjusting IHOP, heap waste, free space, and young‑gen size on JDK 8 and JDK 11, they cut full GC occurrences by up to 80% and markedly improved overall reliability.

Cluster stabilityG1GCJDK11
0 likes · 13 min read
How Uber Tuned GC to Boost Presto Cluster Stability