Author

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

Articles

Likes

129

Views

Comments

Latest from Past Memory Big Data

59 recent articles

Past Memory Big Data

Apr 19, 2025 · Artificial Intelligence

Databricks Acquires Fennel: Is Real-Time Computing + AI the Ultimate Data Platform?

The article examines Databricks' acquisition of the incremental computation engine Fennel, detailing how its unified batch‑stream processing, incremental updates, Python‑native development, and built‑in data governance can eliminate data silos, cut costs by up to 90 % and accelerate real‑time feature engineering for AI models, while also discussing industry impact and future roadmap.

AI-infrastructureDatabricksFennel

0 likes · 6 min read

Databricks Acquires Fennel: Is Real-Time Computing + AI the Ultimate Data Platform?

Past Memory Big Data

Dec 27, 2024 · Big Data

How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet

Uber’s data lake on Hadoop stores hundreds of petabytes in Parquet files and, by adopting ZSTD compression, column pruning, and column reordering, achieves up to 79% storage reduction and significant vCore savings, with detailed benchmarks guiding optimal compression levels and open‑source contributions.

Apache ParquetBig DataHadoop

0 likes · 14 min read

How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet

Past Memory Big Data

Dec 26, 2024 · Big Data

Eliminate Shuffle: Deep Dive into Spark’s Storage Partition Join (SPJ)

This article explains how Spark ≥ 3.3’s Storage Partition Join (SPJ) can avoid costly shuffle operations by using Iceberg tables, outlines the required table properties and Spark configurations, demonstrates the effect with code examples and execution plans, and explores several realistic join scenarios.

Apache IcebergBig DataSPJ

0 likes · 16 min read

Eliminate Shuffle: Deep Dive into Spark’s Storage Partition Join (SPJ)

Past Memory Big Data

Dec 24, 2024 · Big Data

Magnet: A Push‑Based Shuffle Service that Scales to Petabyte‑Level Data Processing

LinkedIn’s massive Spark workloads suffer from shuffle bottlenecks caused by tiny shuffle blocks, unreliable RPC connections, and data skew, so the authors design Magnet—a push‑merge shuffle service that merges blocks into large chunks, improves disk I/O, tolerates failures, and cuts end‑to‑end job time by nearly 30% regardless of hardware.

Disk I/O optimizationLarge‑scale data processingPush‑based service

0 likes · 56 min read

Magnet: A Push‑Based Shuffle Service that Scales to Petabyte‑Level Data Processing

Past Memory Big Data

Nov 8, 2024 · Big Data

How Spark on Kubernetes Transformed Duodian DMALL’s Big Data Platform

The article details Duodian DMALL’s migration from a traditional Hadoop stack to a cloud‑native Spark‑on‑Kubernetes architecture, explaining the motivations, design choices, component selections, operational challenges, and lessons learned through concrete examples and performance observations.

Apache CelebornBig DataCloud Native

0 likes · 21 min read

How Spark on Kubernetes Transformed Duodian DMALL’s Big Data Platform

Past Memory Big Data

Sep 13, 2024 · Backend Development

How Didi Scales Online Search with Elasticsearch: Architecture, Performance, and Stability

The article details Didi's comprehensive use of Elasticsearch across all online retrieval scenarios, covering its physical‑machine architecture, gateway and control layers, data synchronization methods, cross‑datacenter replication, JDK17 + ZGC performance upgrades, cost‑saving ZSTD compression, multi‑tenant isolation, custom security, and ongoing stability practices leading to a planned upgrade to Elasticsearch 8.13.

Cross‑Datacenter ReplicationDidiElasticsearch

0 likes · 16 min read

How Didi Scales Online Search with Elasticsearch: Architecture, Performance, and Stability

Past Memory Big Data

Aug 2, 2024 · Big Data

How Haijing Tech Built a Real-Time Telecom Analytics Platform with ByConity

Haijing Technology faced Hadoop's real‑time limits and ClickHouse's operational pain points, so it adopted the open‑source ByConity platform, which provides a unified table engine, fast multi‑table joins, and seamless scaling to deliver a carrier‑grade real‑time analytics solution.

Big DataByConityClickHouse

0 likes · 11 min read

How Haijing Tech Built a Real-Time Telecom Analytics Platform with ByConity

Past Memory Big Data

Jun 27, 2024 · Big Data

Inside Presto 2.0: The Native C++ Query Engine Explained

This article provides a detailed technical overview of Presto 2.0, the native C++ query engine built on the Velox library, covering its motivation, vectorized architecture, memory management, performance benchmarks from Meta and IBM, and deployment practices for large‑scale data warehouses.

Big DataC#Data Warehouse

0 likes · 15 min read

Inside Presto 2.0: The Native C++ Query Engine Explained

Past Memory Big Data

Jun 20, 2024 · Big Data

How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox

This article details Meituan's production‑grade adoption of Spark vectorized execution via the open‑source Gluten and Velox stack, explaining SIMD fundamentals, performance motivations, the end‑to‑end integration workflow, staged rollout, encountered challenges, and the resulting resource savings and speedups.

Big DataGlutenORC

0 likes · 33 min read

How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox

Past Memory Big Data

Jun 6, 2024 · Operations

How Uber Tuned GC to Boost Presto Cluster Stability

Uber runs over 20 Presto clusters serving more than 500,000 daily queries, but frequent full GCs and OOMs threatened stability; by analyzing G1GC behavior and adjusting IHOP, heap waste, free space, and young‑gen size on JDK 8 and JDK 11, they cut full GC occurrences by up to 80% and markedly improved overall reliability.

Cluster stabilityG1GCJDK11

0 likes · 13 min read

How Uber Tuned GC to Boost Presto Cluster Stability