Tagged articles

Spark

623 articles · Page 1 of 7
dbaplus Community
dbaplus Community
Jul 2, 2026 · Fundamentals

Production Hit by Silent Data Corruption: JDK 25 G1GC Bug Explained

A rare silent data‑corruption bug in JDK 25’s G1GC caused Parquet and ORC files written by Spark and Flink to become unreadable, prompting a multi‑stage investigation that traced the issue to an optional evacuation flaw affecting JNI‑pinned objects, which was later back‑ported and fixed in the OpenJDK community.

AI debuggingFlinkG1GC
0 likes · 20 min read
Production Hit by Silent Data Corruption: JDK 25 G1GC Bug Explained
DataFunTalk
DataFunTalk
Jun 30, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

Xiaohongshu, with over 3.5 billion monthly users and daily logs in the trillions, migrated 500 PB of data to Alibaba Cloud and iterated its data platform through four architecture generations—ClickHouse‑based ad‑hoc, Lambda, Lakehouse, and a unified incremental compute model—cutting resource, development, and storage costs to one‑third while delivering sub‑10‑second query latency at petabyte scale.

Big DataClickHouseData Architecture
0 likes · 22 min read
How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era
DataFunTalk
DataFunTalk
Jun 24, 2026 · Big Data

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu, with over 350 million monthly users and daily logs in the billions, migrated its data platform from AWS to Alibaba Cloud and iterated four times—from a ClickHouse‑based ad‑hoc layer to a Lambda architecture and finally a Lakehouse with incremental compute—cutting architecture complexity, resource cost and development effort each to about one‑third while delivering second‑level analytics on trillion‑scale data.

Big DataClickHouseData Architecture
0 likes · 22 min read
How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era
DataFunTalk
DataFunTalk
Jun 21, 2026 · Big Data

How Zhihu Optimized Spark Jobs with Gluten: A Practical Deep‑Dive

This article details Zhihu's end‑to‑end experience of migrating Spark SQL workloads to the open‑source Gluten framework, covering background performance benchmarks, the architecture of Gluten and Velox, consistency and performance challenges encountered during migration, the concrete fixes applied, and the resulting resource savings and future plans.

Big DataGlutenOptimization
0 likes · 22 min read
How Zhihu Optimized Spark Jobs with Gluten: A Practical Deep‑Dive
DataFunTalk
DataFunTalk
Jun 20, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's step‑by‑step migration from a simple ClickHouse‑based analytics stack to a Lambda‑style 2.0 architecture and finally to a Lakehouse‑based 3.0 design, highlighting concrete performance numbers, cost reductions, and the definition of a generic incremental‑compute model (SPOT) that underpins the evolution.

Big DataClickHouseData Architecture
0 likes · 22 min read
How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era
DataFunTalk
DataFunTalk
May 28, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse with generic incremental compute, cutting architecture complexity, resource and development costs by one‑third while delivering second‑level queries over trillions of rows.

Big DataClickHouseData Architecture
0 likes · 21 min read
How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era
Big Data Tech Team
Big Data Tech Team
May 24, 2026 · Big Data

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

This guide compiles the most frequent interview pitfalls for data warehouse roles, covering SQL join and aggregation errors, window function misuse, subquery versus CTE performance myths, dimensional modeling mistakes, SCD implementation traps, layered design issues, data quality handling, ETL traps, Hive and Spark performance questions, real‑time warehousing considerations, and effective interview strategies.

Big DataETLHive
0 likes · 3 min read
Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes
DataFunSummit
DataFunSummit
May 22, 2026 · Big Data

How OPPO Accelerates Multimodal Data & AI Fusion with Gravitino and Curvine

OPPO tackles explosive multimodal data growth by unifying metadata with Gravitino and boosting I/O performance using the open‑source Curvine cache, delivering a four‑layer data‑lake architecture that resolves data islands, metadata chaos, and bandwidth bottlenecks while achieving near‑commercial query speeds.

CurvineGravitinoLanceDB
0 likes · 11 min read
How OPPO Accelerates Multimodal Data & AI Fusion with Gravitino and Curvine
DataFunTalk
DataFunTalk
May 22, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Complexity and Cost by One‑Third in the Big AI Data Era

The article details Xiaohongshu's evolution from a simple ClickHouse‑based analytics layer to a Lambda‑enabled 2.0 stack and finally a Lakehouse‑based 3.0 architecture, showing how each iteration reduced infrastructure complexity, resource consumption and development effort by roughly one‑third while supporting trillions of daily events and AI‑driven use cases.

Big DataClickHouseData Architecture
0 likes · 21 min read
How Xiaohongshu Cut Data Architecture Complexity and Cost by One‑Third in the Big AI Data Era
DataFunTalk
DataFunTalk
May 11, 2026 · Big Data

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse built on Iceberg, StarRocks, Flink and Spark, cutting architecture complexity, resource and development costs by two‑thirds while supporting trillions of daily events with sub‑second query latency.

Big DataClickHouseFlink
0 likes · 22 min read
How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era
DataFunTalk
DataFunTalk
May 6, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's four‑stage data‑platform evolution—from a simple ClickHouse ad‑hoc setup to a Lambda‑based 2.0 design and finally a lakehouse‑driven 3.0 architecture—highlighting the adoption of general incremental compute, cost‑reduction to one‑third, performance gains of up to ten‑fold, and the SPOT standards that guide the new system.

Big DataClickHouseData Architecture
0 likes · 21 min read
How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era
DataFunTalk
DataFunTalk
Apr 29, 2026 · Big Data

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based analytics stack to a unified lakehouse with generic incremental compute, cutting architecture complexity, resource cost, and development effort by roughly one‑third while supporting petabyte‑scale, sub‑second queries across its 350 million‑user app.

Big DataClickHouseData Architecture
0 likes · 22 min read
How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 29, 2026 · Big Data

Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision

The article dissects a credit data system architecture, detailing six logical layers—from multi-source data collection and feature engineering (including graph features and feature stores) to model training, real‑time stream processing, decision engine integration, and privacy‑preserving computation—while explaining the trade‑offs, tools, and performance targets needed for accurate, low‑latency risk assessment.

Credit ScoringFeature StoreFlink
0 likes · 16 min read
Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision
Big Data Tech Team
Big Data Tech Team
Apr 8, 2026 · Interview Experience

Master Spark Tuning for Data Warehouse Interviews: Real Cases & Tips

Learn how to demonstrate real Spark optimization skills in data‑warehouse interviews by exploring two detailed case studies—small‑file merging in ODS‑to‑DWD ETL and shuffle‑skew mitigation in DWS aggregation—plus key interview questions and practical troubleshooting steps that separate theory from hands‑on expertise.

Data WarehouseInterview TipsPerformance Tuning
0 likes · 9 min read
Master Spark Tuning for Data Warehouse Interviews: Real Cases & Tips
Baidu Geek Talk
Baidu Geek Talk
Mar 23, 2026 · Databases

How Baidu’s MEG Platform Revamped ClickHouse with a Lakehouse Architecture

This article analyzes the challenges of scaling ClickHouse within Baidu’s MEG data platform and details a lake‑house solution that decouples storage and compute, integrates a meta‑service for transparent data access, optimizes query performance through caching, data roll‑up and layout tuning, and introduces a unified query gateway that gracefully falls back to Spark for complex workloads.

ClickHouseData PlatformLakehouse
0 likes · 25 min read
How Baidu’s MEG Platform Revamped ClickHouse with a Lakehouse Architecture
DeWu Technology
DeWu Technology
Mar 2, 2026 · Big Data

Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases

This article provides a comprehensive guide to Spark UI, explaining each primary and secondary tab, the key metrics they expose, and how to interpret them for performance bottleneck detection, followed by two detailed case studies and practical tuning recommendations for Spark workloads.

Big DataCase StudyMetrics
0 likes · 19 min read
Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases
Architect-Kip
Architect-Kip
Mar 2, 2026 · Big Data

How to Build a Scalable Tiered Archive & Query System for MySQL Data

This article presents a comprehensive design for a layered storage and unified scheduling platform that archives MySQL historical data, reduces storage costs, ensures high‑performance queries, and enables efficient data analysis through tiered hot, warm, and cold storage using big‑data technologies.

Data ArchivingDorisFlink
0 likes · 13 min read
How to Build a Scalable Tiered Archive & Query System for MySQL Data
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 2, 2026 · Big Data

Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

This article details how Alibaba Cloud EMR Serverless Spark combined with the Paimon lakehouse framework enables Taobao Flash Sale’s retail data team to achieve low‑latency, high‑throughput real‑time analytics, batch processing, and feature generation, outlining architecture evolution, performance gains, and practical Spark tuning techniques.

Big DataLakehousePaimon
0 likes · 18 min read
Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale
Big Data Technology Tribe
Big Data Technology Tribe
Jan 20, 2026 · Big Data

Extending Spark SQL with LanceSparkSessionExtensions: A Complete Guide

This article explains how to inject the LanceSpark plugin into Spark, covering the core LanceSparkSessionExtensions class, various ways to register extensions, the custom parser and planner strategy implementations, and the underlying Spark mechanisms such as injectParser, injectPlannerStrategy, and PredicateHelper.

DataSourceV2LanceSparkPlannerStrategy
0 likes · 14 min read
Extending Spark SQL with LanceSparkSessionExtensions: A Complete Guide
ITPUB
ITPUB
Jan 15, 2026 · Databases

How to Migrate ClickHouse Data to Doris: Three Practical Strategies Tested

Facing a ClickHouse cluster shutdown, the author explores three migration methods—using Doris’s ClickHouse catalog, exporting to files with Broker/Stream Load, and Spark—to transfer ~10 billion rows to Doris, evaluating each for simplicity, bugs, and performance, and sharing detailed steps, code snippets, and benchmark results.

ClickHouseData MigrationDoris
0 likes · 9 min read
How to Migrate ClickHouse Data to Doris: Three Practical Strategies Tested
Big Data Tech Team
Big Data Tech Team
Jan 5, 2026 · Big Data

Top 10 Data Warehouse Interview Questions Every 2026 Engineer Must Master

This article compiles the most frequently asked interview questions for 2026 data‑warehouse development engineers, covering core concepts, layer architecture, SQL optimization, window functions, Hive vs Spark, data skew solutions, modeling metrics, slowly changing dimensions, scheduling tools, data quality monitoring, and real project experience.

Data WarehouseHiveInterview Questions
0 likes · 8 min read
Top 10 Data Warehouse Interview Questions Every 2026 Engineer Must Master
DevOps Engineer
DevOps Engineer
Dec 27, 2025 · Artificial Intelligence

Demystifying GitHub AI: Models, Agents, Spaces, Spark, and More

This article explains GitHub's AI ecosystem—Models, Copilot, Agents, Spaces, Spark, Instructions, Skills, and the Model Context Protocol—clarifying each component, their relationships, and practical steps for developers to integrate them into their workflow.

AgentsCopilotGitHub AI
0 likes · 12 min read
Demystifying GitHub AI: Models, Agents, Spaces, Spark, and More
vivo Internet Technology
vivo Internet Technology
Dec 10, 2025 · Big Data

Vivo’s 800‑Day Journey Optimizing Celeborn Remote Shuffle Service at PB Scale

This technical report details how Vivo’s big‑data platform adopted Celeborn as its remote shuffle service, evaluated alternatives, tuned hardware and software configurations, implemented performance and stability enhancements, and outlines future operational and community‑driven improvements for handling petabyte‑scale shuffle workloads.

Big DataRemote Shuffle ServiceSpark
0 likes · 20 min read
Vivo’s 800‑Day Journey Optimizing Celeborn Remote Shuffle Service at PB Scale
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 15, 2025 · Big Data

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

This article chronicles a ten‑year evolution of a self‑built big data platform—detailing early Hadoop clusters, successive migrations to Spark, Hive, Hudi, and StarRocks, the operational challenges encountered, and the comprehensive shift to Alibaba Cloud EMR Serverless that delivered significant cost, performance, and stability gains while outlining future intelligent‑ecosystem plans.

Big DataData LakeEMR Serverless
0 likes · 17 min read
From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse
Instant Consumer Technology Team
Instant Consumer Technology Team
Nov 10, 2025 · Big Data

Fixing Multi‑Version, Multi‑Cluster and HA with Apache Kyuubi for Spark/Flink

Apache Kyuubi, an enterprise‑grade multi‑tenant data gateway, replaces Livy and Flink SQL Gateway to support multiple engine versions, cross‑cluster elastic scheduling, high‑availability batch jobs, and traffic control, dramatically reducing deployment complexity, improving resource utilization, and accelerating release cycles for large‑scale Spark and Flink workloads.

Apache KyuubiBig DataData Gateway
0 likes · 18 min read
Fixing Multi‑Version, Multi‑Cluster and HA with Apache Kyuubi for Spark/Flink
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Oct 18, 2025 · Big Data

Alibaba Cloud EMR’s AI Evolution: Accelerating Big Data Performance

Since its 2016 launch, Alibaba Cloud EMR has transformed from a basic open‑source Hadoop service into a high‑performance, AI‑enabled big‑data platform, delivering optimized I/O, vectorized processing, and integrated AI functions such as natural‑language SQL, StarRocks and Spark enhancements, while supporting diverse industry workloads.

Cloud ComputingEMRSpark
0 likes · 9 min read
Alibaba Cloud EMR’s AI Evolution: Accelerating Big Data Performance
DataFunSummit
DataFunSummit
Sep 21, 2025 · Big Data

Breaking the CPU Wall: BIGO’s Gluten Engine Accelerates Spark and Flink

When big‑data workloads hit the CPU wall, BIGO’s adoption of the open‑source Gluten project delivers native‑engine execution for Spark and a roadmap for Flink, achieving up to 30% end‑to‑end speedup, 50% memory savings, and a scalable, cost‑effective data processing platform.

Big DataFlinkGluten
0 likes · 16 min read
Breaking the CPU Wall: BIGO’s Gluten Engine Accelerates Spark and Flink
Architect's Must-Have
Architect's Must-Have
Sep 15, 2025 · Big Data

Mastering Spark Streaming Rate Control: A Deep Dive into Backpressure

This article explains Spark Streaming's rate control mechanisms, covering static limits, the dynamic back‑pressure feature introduced in Spark 1.5, the PID‑based estimator, RPC communication, and how Guava's token‑bucket RateLimiter enforces the calculated thresholds to ensure stability and optimal throughput.

RateControlSparkStreaming
0 likes · 13 min read
Mastering Spark Streaming Rate Control: A Deep Dive into Backpressure
Big Data Tech Team
Big Data Tech Team
Aug 25, 2025 · Interview Experience

Essential Big Data Interview Questions for Data Warehouse Engineer Roles

A comprehensive list of interview topics covering self‑introduction, career moves, data‑warehouse design, team building, architecture comparisons, fact‑table classification, common dimensions, performance tuning, and data‑governance for aspiring big‑data engineers.

Big DataData GovernanceFlink
0 likes · 4 min read
Essential Big Data Interview Questions for Data Warehouse Engineer Roles
Big Data Tech Team
Big Data Tech Team
Aug 24, 2025 · Big Data

Top 18 Data Warehouse Engineer Interview Questions from Meituan and ByteDance

This article compiles 18 essential interview topics for data warehouse engineer roles, covering self‑introduction, architecture layering, dimensional modeling, HDFS operations, Spark vs MapReduce, join implementation, SQL challenges, OLAP selection, real‑time quality assurance, and job transition considerations.

Data WarehouseHDFSOLAP
0 likes · 3 min read
Top 18 Data Warehouse Engineer Interview Questions from Meituan and ByteDance
Su San Talks Tech
Su San Talks Tech
Jul 17, 2025 · Big Data

How to De‑Duplicate 1 Billion QQ Numbers Using Under 1 GB of Memory

This article explores multiple techniques—including bitmap indexing, Bloom filters, external sorting, Spark, and layered bitmap structures—to efficiently remove duplicate QQ numbers from a dataset of up to one billion entries while keeping memory usage below a gigabyte and maintaining high accuracy.

DeduplicationSparkbitmap
0 likes · 12 min read
How to De‑Duplicate 1 Billion QQ Numbers Using Under 1 GB of Memory
Architect
Architect
Jul 7, 2025 · Big Data

How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×

This article explains how Baidu’s search data team redesigned its data warehouse with wide‑table modeling, Parquet columnar storage, and a Spark‑ClickHouse fusion engine, eliminating redundancy, cutting query latency from minutes to seconds, and enabling self‑service analytics for thousands of users.

Data WarehouseETLParquet
0 likes · 21 min read
How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 4, 2025 · Big Data

Spark 4.0: New Features, Performance Gains, and Why It Still Leads Big Data

Despite the hype around Flink and AI models, Spark 4.0’s release brings a lightweight Python client, Spark Connect GA, enhanced SQL optimization, vectorized execution, and AI integration, reaffirming its leading position in the big‑data ecosystem while hinting at future challenges and innovations.

Big DataData EngineeringPerformance Optimization
0 likes · 6 min read
Spark 4.0: New Features, Performance Gains, and Why It Still Leads Big Data
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 10, 2025 · Big Data

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

This article details how a leading automotive parts supply‑chain platform migrated from a traditional Hadoop stack to Alibaba Cloud EMR Serverless Spark and DataWorks, achieving faster, more elastic, and cost‑effective data processing, enhanced AI integration, and significant operational improvements across multiple business scenarios.

Big DataCloud NativeData Lake
0 likes · 12 min read
Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark
Big Data Tech Team
Big Data Tech Team
Apr 17, 2025 · Big Data

Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions

This article compiles a comprehensive set of Spark interview questions frequently asked by leading tech companies, providing detailed explanations of Spark’s performance mechanisms, architecture, RDD persistence, checkpointing, streaming, dependency types, HA setup, and practical coding examples to help data warehouse engineers prepare effectively.

Data WarehouseRDDSpark
0 likes · 21 min read
Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions
vivo Internet Technology
vivo Internet Technology
Apr 16, 2025 · Big Data

Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

The article explains how the vivo Internet Big Data team containerized offline Spark jobs and deployed them with the Spark Operator on a mixed online‑offline Kubernetes cluster, using elastic scheduling and resource‑over‑subscription to boost CPU utilization by 30‑40% and handle over 100,000 daily tasks.

Big DataResource ManagementSpark
0 likes · 36 min read
Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management
DataFunSummit
DataFunSummit
Apr 3, 2025 · Big Data

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

The Apache Hudi Asia technical salon held in Beijing on March 29 gathered over 230 on‑site participants and 16,000 online viewers, featuring expert talks from leading Chinese tech companies that showcased real‑world Hudi implementations, performance optimizations, and future roadmap for data‑lake technologies.

Apache HudiBig DataData Lake
0 likes · 13 min read
Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 27, 2025 · Big Data

Cost‑Effective Real‑Time Data Warehouse 2.0: Migrating from Kafka to Iceberg

iQIYI transformed its real‑time data warehouse by replacing a costly Kafka‑based Lambda stack with a unified stream‑batch Iceberg lake, cutting storage expenses by 90%, halving compute costs, extending data retention, and delivering minute‑level freshness for 90% of use cases while preserving second‑level processing where needed.

FlinkIcebergReal-Time Data Warehouse
0 likes · 11 min read
Cost‑Effective Real‑Time Data Warehouse 2.0: Migrating from Kafka to Iceberg
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 20, 2025 · Big Data

How to Read and Write StarRocks Data with EMR Serverless Spark

This step‑by‑step guide explains how to use EMR Serverless Spark together with the StarRocks Spark Connector to create a workspace, upload the connector JAR, configure network connections, create databases and tables in StarRocks, and perform read/write operations via SQL sessions, Notebook sessions, or batch Spark jobs, complete with code examples and UI screenshots.

Big DataData IntegrationEMR Serverless
0 likes · 14 min read
How to Read and Write StarRocks Data with EMR Serverless Spark
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 3, 2025 · Big Data

The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering

The article analyzes how the rapid rise of open‑source large‑model AI in 2025 is reshaping the data development profession, urging developers to transition from specialized data‑engineer roles to full‑stack AI data engineering skills such as distributed computing, lake‑house architectures, and model tuning.

AIBig DataData Engineering
0 likes · 7 min read
The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering
DataFunSummit
DataFunSummit
Feb 22, 2025 · Big Data

Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL

The article introduces Blaze, Kuaishou's Rust‑powered native execution engine that vectorizes Spark SQL workloads, explains its architecture and operation, presents benchmark results showing up to 50% latency reduction, and details internal deployments, industry case studies, community collaborations, and the 2025 roadmap.

Big DataPerformance OptimizationSpark
0 likes · 12 min read
Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL
DataFunTalk
DataFunTalk
Feb 20, 2025 · Big Data

From Integrated Storage‑Compute to Decoupled Architecture: Practical Exploration of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue in Big Data Platforms

This article analyzes the transition from a tightly coupled storage‑compute architecture to a decoupled model, detailing how Kubernetes, Kyuubi, Celeborn, Blaze, and Hue together solve resource inefficiencies, improve scalability, and boost query performance in modern big‑data environments.

Big DataBlazeKyuubi
0 likes · 16 min read
From Integrated Storage‑Compute to Decoupled Architecture: Practical Exploration of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue in Big Data Platforms
21CTO
21CTO
Feb 4, 2025 · Big Data

Why Python Beats Java and Scala for Modern Data Engineering

The article compares Java, Scala, SQL, and Python for data‑engineering tasks, arguing that Python’s versatility, rich ecosystem, and ease of use make it the preferred language for both small‑scale and massive Spark workloads despite its performance trade‑offs.

Big DataData EngineeringSQL
0 likes · 7 min read
Why Python Beats Java and Scala for Modern Data Engineering
DataFunSummit
DataFunSummit
Feb 1, 2025 · Big Data

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

This article explains the challenges of big‑data processing in the cloud era, introduces Spark’s native‑language SQL engine rewrites, discusses vectorization and code generation techniques, describes cloud‑native storage‑compute separation with Remote Shuffle services such as Apache Celeborn, and presents the production benefits of Alibaba Cloud’s EMR Serverless Spark.

Big DataCodegenEMR Serverless
0 likes · 12 min read
Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 26, 2025 · Big Data

How a FinTech Scaled Its Data Platform with Alibaba Cloud EMR Serverless Spark

Weifin, a fintech innovator, tackled massive data‑scale challenges by adopting Alibaba Cloud EMR Serverless Spark, building a unified Spark‑based platform that supports data collection, lake ingestion, distributed machine‑learning training, and intelligent risk‑control applications, while achieving performance gains, cost reduction, and scalable automation.

FinTechSparkmachine learning
0 likes · 10 min read
How a FinTech Scaled Its Data Platform with Alibaba Cloud EMR Serverless Spark
Airbnb Technology Team
Airbnb Technology Team
Jan 24, 2025 · Artificial Intelligence

Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning

Chronon is an open‑source framework that centralizes feature definitions to guarantee training‑inference consistency, eliminates complex ETL pipelines, and supports real‑time and batch processing across diverse data sources, cutting feature‑development cycles from months to under a week, as demonstrated by Airbnb’s 40,000‑feature deployment.

ChrononHiveReal-time Data
0 likes · 10 min read
Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning
dbaplus Community
dbaplus Community
Jan 19, 2025 · Big Data

How to Write Elegant, High‑Performance SQL for Big Data Pipelines

This article shares practical techniques for writing clean, efficient SQL in large‑scale data environments, covering predicate pushdown, sub‑queries, deduplication strategies, bucket optimization, and automation with Python‑Spark integration to improve readability and execution speed.

HiveOptimizationSpark
0 likes · 14 min read
How to Write Elegant, High‑Performance SQL for Big Data Pipelines
DataFunSummit
DataFunSummit
Jan 16, 2025 · Big Data

Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service

This article details Zhihu's comprehensive cost‑reduction and efficiency‑boosting initiatives for its big‑data platform, covering FinOps‑driven financial operations, hybrid‑cloud architecture, cost allocation models, operational monitoring, and technical optimizations such as erasure coding, ZSTD compression, Spark auto‑tuning, and a remote shuffle service.

Big DataCloud Cost ManagementFinOps
0 likes · 22 min read
Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service
DataFunSummit
DataFunSummit
Jan 14, 2025 · Big Data

Tencent Real-Time Lakehouse Intelligent Optimization Practice

This presentation details Tencent's real‑time lakehouse architecture and the four key topics—lakehouse design, intelligent optimization services, scenario‑driven capabilities, and future outlook—covering components such as Spark, Flink, Iceberg, Auto‑Optimize Service, indexing, clustering, AutoEngine, and PyIceberg implementations.

Auto OptimizeBig DataFlink
0 likes · 12 min read
Tencent Real-Time Lakehouse Intelligent Optimization Practice
DataFunSummit
DataFunSummit
Jan 3, 2025 · Big Data

Tencent Real‑Time Lakehouse Intelligent Optimization Practices

This article presents Tencent's end‑to‑end real‑time lakehouse architecture, detailing its three‑layer design, the Auto Optimize Service modules such as compaction, indexing, clustering and engine acceleration, as well as scenario‑driven capabilities like multi‑stream joins, primary‑key tables, in‑place migration and PyIceberg support, and concludes with future optimization directions.

Big DataFlinkIceberg
0 likes · 11 min read
Tencent Real‑Time Lakehouse Intelligent Optimization Practices
Bilibili Tech
Bilibili Tech
Jan 3, 2025 · Big Data

Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili

Bilibili replaced Spark’s unstable External Shuffle Service with a push‑based approach, then deployed Apache Celeborn’s remote shuffle on Kubernetes using HA masters, tiered workers, extensive monitoring, history‑based routing, chaos testing, and seamless Spark, Flink, and MapReduce integration, while planning self‑healing, elastic scaling, and priority‑aware I/O enhancements.

Apache CelebornBig DataFlink
0 likes · 28 min read
Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2025 · Big Data

Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details

This article provides a comprehensive overview of Apache Paimon, covering its real‑time lake ingestion, unified stream‑batch processing, table types (primary‑key and append‑only), LSM‑tree storage, bucket mechanisms, merge‑engine options, compaction strategies, concurrency control, consumption methods, tag management, data cleanup, and system tables for big‑data workloads.

Apache PaimonBig DataFlink
0 likes · 25 min read
Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 31, 2024 · Big Data

Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables

This article explains how Spark ≥ 3.3 introduces Storage Partitioned Join (SPJ) to avoid costly shuffle operations when joining partitioned V2 source tables such as Apache Iceberg, detailing the required conditions, configuration settings, practical code examples, and various join scenarios including mismatched partitions and data skew.

BucketingData SkewSQL
0 likes · 15 min read
Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables
JD Tech
JD Tech
Dec 30, 2024 · Big Data

Techniques for Writing Elegant and Efficient SQL in Big Data Environments

The article shares practical methods and code examples for making SQL both readable and high‑performing in large‑scale data platforms, covering predicate push‑down with subqueries, deduplication strategies, bucket utilization, and Python‑driven job parameter handling.

Data EngineeringHiveSQL
0 likes · 14 min read
Techniques for Writing Elegant and Efficient SQL in Big Data Environments
DataFunSummit
DataFunSummit
Dec 27, 2024 · Big Data

Tencent Real-time Lakehouse Intelligent Optimization Practice

This presentation describes Tencent's real-time lakehouse architecture, including data lake compute, management, and storage layers, and details the intelligent optimization services—such as compaction, indexing, clustering, and auto-engine—designed to improve query performance, storage cost, and operational efficiency for large-scale data processing.

AutoEngineCompactionFlink
0 likes · 11 min read
Tencent Real-time Lakehouse Intelligent Optimization Practice
Bilibili Tech
Bilibili Tech
Dec 27, 2024 · Big Data

Consistency Architecture for Bilibili Recommendation Model Data Flow

The article outlines Bilibili’s revamped recommendation data‑flow architecture that eliminates timing and calculation inconsistencies by snapshotting online features, unifying feature computation in a single C++ library accessed via JNI, and orchestrating label‑join and sample extraction through near‑line Kafka/Flink pipelines, with further performance gains and Iceberg‑based future extensions.

Data ConsistencyFlinkIceberg
0 likes · 12 min read
Consistency Architecture for Bilibili Recommendation Model Data Flow
Past Memory Big Data
Past Memory Big Data
Dec 27, 2024 · Big Data

How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet

Uber’s data lake on Hadoop stores hundreds of petabytes in Parquet files and, by adopting ZSTD compression, column pruning, and column reordering, achieves up to 79% storage reduction and significant vCore savings, with detailed benchmarks guiding optimal compression levels and open‑source contributions.

Apache ParquetBig DataHadoop
0 likes · 14 min read
How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet
Past Memory Big Data
Past Memory Big Data
Dec 26, 2024 · Big Data

Eliminate Shuffle: Deep Dive into Spark’s Storage Partition Join (SPJ)

This article explains how Spark ≥ 3.3’s Storage Partition Join (SPJ) can avoid costly shuffle operations by using Iceberg tables, outlines the required table properties and Spark configurations, demonstrates the effect with code examples and execution plans, and explores several realistic join scenarios.

Apache IcebergBig DataSPJ
0 likes · 16 min read
Eliminate Shuffle: Deep Dive into Spark’s Storage Partition Join (SPJ)
dbaplus Community
dbaplus Community
Dec 24, 2024 · Big Data

How Bilibili Scaled Its Tag System for Massive Data and Real‑Time Accuracy

The article details Bilibili's comprehensive redesign of its tag system—including background challenges, architectural layers, technical upgrades like Iceberg integration and shard‑based ClickHouse writes, crowd selection methods, online service guarantees, performance metrics, and future plans—showcasing a data‑driven solution that boosts stability, speed, and business coverage.

ClickHouseData EngineeringDistributed Computing
0 likes · 24 min read
How Bilibili Scaled Its Tag System for Massive Data and Real‑Time Accuracy
Past Memory Big Data
Past Memory Big Data
Dec 24, 2024 · Big Data

Magnet: A Push‑Based Shuffle Service that Scales to Petabyte‑Level Data Processing

LinkedIn’s massive Spark workloads suffer from shuffle bottlenecks caused by tiny shuffle blocks, unreliable RPC connections, and data skew, so the authors design Magnet—a push‑merge shuffle service that merges blocks into large chunks, improves disk I/O, tolerates failures, and cuts end‑to‑end job time by nearly 30% regardless of hardware.

Disk I/O optimizationLarge‑scale data processingPush‑based service
0 likes · 56 min read
Magnet: A Push‑Based Shuffle Service that Scales to Petabyte‑Level Data Processing
dbaplus Community
dbaplus Community
Dec 14, 2024 · Databases

Why a Database‑First Operating System Could Replace Linux and Kubernetes

The article examines the DBOS concept—a database‑oriented operating system that places a distributed, transactional database at the core of OS services, tracing its roots from early database pioneers to modern cloud workloads and highlighting its potential advantages over traditional Linux‑Kubernetes stacks.

Cloud ComputingDBOSSpark
0 likes · 10 min read
Why a Database‑First Operating System Could Replace Linux and Kubernetes
Qunar Tech Salon
Qunar Tech Salon
Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataHiveMapReduce
0 likes · 23 min read
Understanding and Solving Small File Problems in Hive and Spark
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Nov 27, 2024 · Big Data

Highlights of Tongcheng Travel’s 8th Big Data Technology Salon

The 8th Tongcheng Travel Big Data Technology Salon in Suzhou featured four expert talks covering Tencent Cloud’s Meson Spark engine, near‑line computing for travel itineraries, a Flink‑based real‑time risk control system, and Apache Paimon’s latest lake‑warehouse innovations, followed by a data‑driven business perspective session.

Apache PaimonBig DataData Lake
0 likes · 7 min read
Highlights of Tongcheng Travel’s 8th Big Data Technology Salon
Bilibili Tech
Bilibili Tech
Nov 12, 2024 · Big Data

Scalable Tag System Architecture and Optimization

The rebuilt tag system introduces a three‑layer architecture, standard pipelines, Iceberg‑backed storage and custom ClickHouse sharding, a DSL for crowd selection, and a stateless online service, achieving 99.9% success, sub‑5 ms latency, and supporting thousands of tags across dozens of business scenarios while planning real‑time processing and automated lifecycle management.

ClickHouseIcebergOnline Service
0 likes · 23 min read
Scalable Tag System Architecture and Optimization
Past Memory Big Data
Past Memory Big Data
Nov 8, 2024 · Big Data

How Spark on Kubernetes Transformed Duodian DMALL’s Big Data Platform

The article details Duodian DMALL’s migration from a traditional Hadoop stack to a cloud‑native Spark‑on‑Kubernetes architecture, explaining the motivations, design choices, component selections, operational challenges, and lessons learned through concrete examples and performance observations.

Apache CelebornBig DataCloud Native
0 likes · 21 min read
How Spark on Kubernetes Transformed Duodian DMALL’s Big Data Platform
Bilibili Tech
Bilibili Tech
Nov 1, 2024 · Big Data

Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform

Magnus is Bilibili’s self‑developed intelligent service that continuously optimizes Iceberg tables by scheduling snapshot expiration, orphan‑file cleanup, manifest rewriting, and multi‑dimensional data optimizations—including small‑file merging, sorting, distribution, and index creation—while automatically recommending configurations from real‑time query logs, delivering over 99.9% task success and up to 30% scan‑data reduction.

Data LakeIcebergIntelligent Recommendation
0 likes · 15 min read
Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform
Open Source Tech Hub
Open Source Tech Hub
Oct 31, 2024 · Big Data

How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark

Bilibili transformed its search indexing pipeline by replacing a manual, low‑throughput process with a distributed KV store (Taishan) and Spark‑based construction, achieving unified data ingestion, reduced resource consumption, faster full‑ and incremental builds, and a shift from daily to hourly indexing cycles.

Big DataDistributed storageIndexing
0 likes · 25 min read
How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark
DataFunSummit
DataFunSummit
Oct 24, 2024 · Big Data

Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment

This article details Bilibili’s implementation of a large‑language‑model‑driven intelligent assistant for its massive big‑data platform, covering background, problem analysis, architectural design, knowledge‑base construction, precision and recall challenges, deployment across offline and real‑time Spark/Flink diagnostics, and future outlooks.

AgentBig DataFlink
0 likes · 23 min read
Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment
Java Architecture Stack
Java Architecture Stack
Oct 18, 2024 · Big Data

How to Fix Spark OOM Errors: Practical Memory & Performance Tuning

This guide analyzes common Spark Out‑Of‑Memory scenarios—such as massive data volumes, data skew, and improper resource allocation—and provides step‑by‑step configurations, memory‑management tweaks, partitioning strategies, and shuffle optimizations to prevent OOM failures in production.

Big DataOOMPerformance Optimization
0 likes · 8 min read
How to Fix Spark OOM Errors: Practical Memory & Performance Tuning
DataFunSummit
DataFunSummit
Sep 30, 2024 · Big Data

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

This article introduces Apache Hudi Streamer, a versatile Spark‑based data ingestion tool likened to a Swiss Army knife, detailing its core options—including table configuration, continuous mode, source classes, transformers, table services, catalog synchronization, and advanced features—while guiding users on practical pipeline setup.

Apache HudiBig DataSpark
0 likes · 10 min read
Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)
Architect
Architect
Sep 24, 2024 · Industry Insights

How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround

This article details Bilibili's transformation of its search offline indexing architecture—from a manual, low‑throughput MySQL‑centric process to a distributed, KV‑based, protobuf‑driven pipeline that leverages Taishan storage and Spark, cutting build cycles from days to hours while solving performance, consistency, and maintenance challenges.

Big DataIndexingSearch
0 likes · 24 min read
How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround
Kuaishou Tech
Kuaishou Tech
Sep 13, 2024 · Big Data

Blaze: Kuaishou’s Rust‑Based Vectorized Execution Engine for Spark SQL

Blaze is a Rust‑implemented, DataFusion‑based vectorized execution engine created by Kuaishou to accelerate Spark SQL queries, delivering up to 60% faster computation, 30% average compute‑power gains in production, and extensive architectural innovations such as native engine, protobuf protocol, JNI bridge, and Spark extension, while being open‑source and compatible with Spark 3.0‑3.5.

Big DataDataFusionSpark
0 likes · 11 min read
Blaze: Kuaishou’s Rust‑Based Vectorized Execution Engine for Spark SQL
dbaplus Community
dbaplus Community
Sep 4, 2024 · Big Data

How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

This article details how Ctrip’s data platform evolved from a single‑IDC design to a multi‑IDC, tiered storage and scheduling architecture, covering the challenges of rapid data growth, the migration to Spark 3 via Kyuubi, the introduction of Celeborn shuffle service, and the resulting performance and reliability gains.

Big DataDistributed storageHDFS
0 likes · 23 min read
How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn
DataFunSummit
DataFunSummit
Aug 17, 2024 · Big Data

AnalyticDB Spark Architecture and Vectorized Engine Performance Overview

This article introduces the AnalyticDB Spark architecture, explains the need for Spark vectorization, surveys industry vectorized solutions, details ADB Spark's own vectorized implementation with Gluten and Velox, and presents performance test results showing a 6.98‑fold speedup over open‑source Spark.

AnalyticDBBig DataGluten
0 likes · 9 min read
AnalyticDB Spark Architecture and Vectorized Engine Performance Overview
Bilibili Tech
Bilibili Tech
Aug 13, 2024 · Big Data

How Bilibili Re‑engineered Its Search Indexing with Distributed Storage and Spark

This article details Bilibili's transformation of its search offline indexing pipeline, moving from manual MySQL‑based processes to a high‑capacity, distributed KV store and Spark‑driven builds, addressing performance, maintenance, and scalability challenges while improving resource efficiency and iteration speed.

Big DataBilibiliDistributed storage
0 likes · 24 min read
How Bilibili Re‑engineered Its Search Indexing with Distributed Storage and Spark
DataFunTalk
DataFunTalk
Jul 23, 2024 · Big Data

Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms

This article shares detailed practical experiences from DingXiangYuan's big‑data platform on using Apache Kyuubi and Apache Celeborn, covering architecture, flexible configuration, AuthZ fine‑grained permissions, small‑file and Z‑Order optimizations, Arrow‑based large result transmission, and operational tips such as connection‑level issues and Netty cache handling.

Apache CelebornApache KyuubiArrow
0 likes · 17 min read
Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Jul 15, 2024 · Big Data

Master Distributed Computing: Hadoop, Spark, and Flink Explained

This article introduces the fundamentals of distributed computing, compares major frameworks such as Hadoop, Spark, and Flink, and outlines their key components, performance characteristics, and typical application scenarios including big‑data analytics, cloud services, real‑time streaming, and scientific computing.

Big DataDistributed ComputingFlink
0 likes · 7 min read
Master Distributed Computing: Hadoop, Spark, and Flink Explained
DataFunSummit
DataFunSummit
Jul 13, 2024 · Big Data

Blaze: A Native Vectorized Execution Engine for Spark – Architecture, Production Optimizations, and Future Plans

Blaze is Kuaishou's self‑developed native execution engine that leverages Rust, DataFusion, and SIMD vectorization to accelerate Spark workloads, offering a 30%+ compute boost, detailed architectural components, deep production‑grade optimizations, and a roadmap for broader adoption.

Big DataDataFusionPerformance Optimization
0 likes · 13 min read
Blaze: A Native Vectorized Execution Engine for Spark – Architecture, Production Optimizations, and Future Plans
360 Smart Cloud
360 Smart Cloud
Jul 9, 2024 · Big Data

Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)

This article examines the critical role of shuffle in big‑data processing, compares Spark's native shuffle with the External Shuffle Service (ESS) and Remote Shuffle Service (RSS) solutions, introduces Uniffle's architecture and configuration, and shares practical deployment experiences and performance results within a 360 internal environment.

Big DataExternal Shuffle ServiceRemote Shuffle Service
0 likes · 15 min read
Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)
Baidu Geek Talk
Baidu Geek Talk
Jul 8, 2024 · Big Data

Evolution of Feed Data Warehouse Wide-Table Modeling at Baidu App

Baidu’s Mobile Ecology team transformed its Feed data warehouse through three progressive stages—hour‑level core tables, a real‑time wide table, and a unified day‑level multi‑version table—consolidating traffic, content, and user data into a single partitioned wide‑table architecture that resolves granularity inconsistencies, cuts processing cost, and delivers real‑time to daily latency for diverse analytics.

Real-timeSparkWide Table
0 likes · 10 min read
Evolution of Feed Data Warehouse Wide-Table Modeling at Baidu App
DataFunTalk
DataFunTalk
Jun 28, 2024 · Big Data

Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation

This article presents a comprehensive technical overview of using ClickHouse as a native backend to accelerate Spark SQL execution, covering Spark performance bottlenecks, ClickHouse's CPU‑level optimizations, the design and implementation of the Spark‑Native integration, and detailed TPC‑DS benchmark results demonstrating up to 3.5× speedup.

Big DataClickHousePerformance Optimization
0 likes · 33 min read
Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation
Baidu Geek Talk
Baidu Geek Talk
Jun 24, 2024 · Big Data

Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

The paper presents a Spark acceleration framework that replaces Java‑based task operators with a ClickHouse native library, converting plans via Protobuf and JNI, leveraging columnar storage, SIMD and JIT to achieve up to 3× speed‑up on TPC‑DS workloads while providing fallback mechanisms to ensure no performance loss.

Big DataClickHouseNative Acceleration
0 likes · 31 min read
Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jun 24, 2024 · Big Data

Boost Spark Performance with ClickHouse: Native Acceleration Techniques

This article presents a detailed technical overview of accelerating Spark's compute engine using ClickHouse as a native backend, covering Spark performance background, ClickHouse's advantages, the design and implementation of a Spark‑Native acceleration solution, and extensive performance evaluation results.

ClickHouseNative AccelerationPerformance Optimization
0 likes · 34 min read
Boost Spark Performance with ClickHouse: Native Acceleration Techniques
DataFunTalk
DataFunTalk
Jun 22, 2024 · Big Data

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

This article details Zhihu's migration of massive Spark and MapReduce shuffle workloads from the External Shuffle Service (ESS) to a push‑based Remote Shuffle Service (RSS) powered by Celeborn, covering background problems, evaluation of open‑source implementations, deployment architecture, encountered issues, solutions, performance gains, and future plans.

Big DataRSSShuffle
0 likes · 19 min read
Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits
Past Memory Big Data
Past Memory Big Data
Jun 20, 2024 · Big Data

How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox

This article details Meituan's production‑grade adoption of Spark vectorized execution via the open‑source Gluten and Velox stack, explaining SIMD fundamentals, performance motivations, the end‑to‑end integration workflow, staged rollout, encountered challenges, and the resulting resource savings and speedups.

Big DataGlutenORC
0 likes · 33 min read
How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox