Tagged articles

Spark

623 articles · Page 1 of 7

Jul 2, 2026 · Fundamentals

Production Hit by Silent Data Corruption: JDK 25 G1GC Bug Explained

A rare silent data‑corruption bug in JDK 25’s G1GC caused Parquet and ORC files written by Spark and Flink to become unreadable, prompting a multi‑stage investigation that traced the issue to an optional evacuation flaw affecting JNI‑pinned objects, which was later back‑ported and fixed in the OpenJDK community.

AI debuggingFlinkG1GC

0 likes · 20 min read

Production Hit by Silent Data Corruption: JDK 25 G1GC Bug Explained

DataFunTalk

Jun 30, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

Xiaohongshu, with over 3.5 billion monthly users and daily logs in the trillions, migrated 500 PB of data to Alibaba Cloud and iterated its data platform through four architecture generations—ClickHouse‑based ad‑hoc, Lambda, Lakehouse, and a unified incremental compute model—cutting resource, development, and storage costs to one‑third while delivering sub‑10‑second query latency at petabyte scale.

Big DataClickHouseData Architecture

0 likes · 22 min read

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

DataFunTalk

Jun 24, 2026 · Big Data

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu, with over 350 million monthly users and daily logs in the billions, migrated its data platform from AWS to Alibaba Cloud and iterated four times—from a ClickHouse‑based ad‑hoc layer to a Lambda architecture and finally a Lakehouse with incremental compute—cutting architecture complexity, resource cost and development effort each to about one‑third while delivering second‑level analytics on trillion‑scale data.

Big DataClickHouseData Architecture

0 likes · 22 min read

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

DataFunTalk

Jun 21, 2026 · Big Data

How Zhihu Optimized Spark Jobs with Gluten: A Practical Deep‑Dive

This article details Zhihu's end‑to‑end experience of migrating Spark SQL workloads to the open‑source Gluten framework, covering background performance benchmarks, the architecture of Gluten and Velox, consistency and performance challenges encountered during migration, the concrete fixes applied, and the resulting resource savings and future plans.

Big DataGlutenOptimization

0 likes · 22 min read

How Zhihu Optimized Spark Jobs with Gluten: A Practical Deep‑Dive

DataFunTalk

Jun 20, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's step‑by‑step migration from a simple ClickHouse‑based analytics stack to a Lambda‑style 2.0 architecture and finally to a Lakehouse‑based 3.0 design, highlighting concrete performance numbers, cost reductions, and the definition of a generic incremental‑compute model (SPOT) that underpins the evolution.

Big DataClickHouseData Architecture

0 likes · 22 min read

DataFunTalk

May 28, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse with generic incremental compute, cutting architecture complexity, resource and development costs by one‑third while delivering second‑level queries over trillions of rows.

Big DataClickHouseData Architecture

0 likes · 21 min read

Big Data Tech Team

May 24, 2026 · Big Data

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

This guide compiles the most frequent interview pitfalls for data warehouse roles, covering SQL join and aggregation errors, window function misuse, subquery versus CTE performance myths, dimensional modeling mistakes, SCD implementation traps, layered design issues, data quality handling, ETL traps, Hive and Spark performance questions, real‑time warehousing considerations, and effective interview strategies.

Big DataETLHive

0 likes · 3 min read

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

DataFunSummit

May 22, 2026 · Big Data

How OPPO Accelerates Multimodal Data & AI Fusion with Gravitino and Curvine

OPPO tackles explosive multimodal data growth by unifying metadata with Gravitino and boosting I/O performance using the open‑source Curvine cache, delivering a four‑layer data‑lake architecture that resolves data islands, metadata chaos, and bandwidth bottlenecks while achieving near‑commercial query speeds.

CurvineGravitinoLanceDB

0 likes · 11 min read

How OPPO Accelerates Multimodal Data & AI Fusion with Gravitino and Curvine

DataFunTalk

May 22, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Complexity and Cost by One‑Third in the Big AI Data Era

The article details Xiaohongshu's evolution from a simple ClickHouse‑based analytics layer to a Lambda‑enabled 2.0 stack and finally a Lakehouse‑based 3.0 architecture, showing how each iteration reduced infrastructure complexity, resource consumption and development effort by roughly one‑third while supporting trillions of daily events and AI‑driven use cases.

Big DataClickHouseData Architecture

0 likes · 21 min read

How Xiaohongshu Cut Data Architecture Complexity and Cost by One‑Third in the Big AI Data Era

DataFunTalk

May 11, 2026 · Big Data

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse built on Iceberg, StarRocks, Flink and Spark, cutting architecture complexity, resource and development costs by two‑thirds while supporting trillions of daily events with sub‑second query latency.

Big DataClickHouseFlink

0 likes · 22 min read

DataFunTalk

May 6, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's four‑stage data‑platform evolution—from a simple ClickHouse ad‑hoc setup to a Lambda‑based 2.0 design and finally a lakehouse‑driven 3.0 architecture—highlighting the adoption of general incremental compute, cost‑reduction to one‑third, performance gains of up to ten‑fold, and the SPOT standards that guide the new system.

Big DataClickHouseData Architecture

0 likes · 21 min read

DataFunTalk

Apr 29, 2026 · Big Data

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based analytics stack to a unified lakehouse with generic incremental compute, cutting architecture complexity, resource cost, and development effort by roughly one‑third while supporting petabyte‑scale, sub‑second queries across its 350 million‑user app.

Big DataClickHouseData Architecture

0 likes · 22 min read

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Lao Guo's Learning Space

Apr 29, 2026 · Big Data

Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision

The article dissects a credit data system architecture, detailing six logical layers—from multi-source data collection and feature engineering (including graph features and feature stores) to model training, real‑time stream processing, decision engine integration, and privacy‑preserving computation—while explaining the trade‑offs, tools, and performance targets needed for accurate, low‑latency risk assessment.

Credit ScoringFeature StoreFlink

0 likes · 16 min read

Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision

Big Data Tech Team

Apr 8, 2026 · Interview Experience

Master Spark Tuning for Data Warehouse Interviews: Real Cases & Tips

Learn how to demonstrate real Spark optimization skills in data‑warehouse interviews by exploring two detailed case studies—small‑file merging in ODS‑to‑DWD ETL and shuffle‑skew mitigation in DWS aggregation—plus key interview questions and practical troubleshooting steps that separate theory from hands‑on expertise.

Data WarehouseInterview TipsPerformance Tuning

0 likes · 9 min read

Master Spark Tuning for Data Warehouse Interviews: Real Cases & Tips

Ctrip Technology

Apr 2, 2026 · Big Data

Why Upgrading to JDK 25 Broke Spark & Flink Data – Inside the G1GC Bug and Its Fix

During a gray‑release of JDK 25 on Ctrip's massive Spark and Flink clusters, silent data corruption appeared in Parquet and ORC files, traced to a G1GC Optional Evacuation bug that moved JNI‑pinned objects, a root cause later back‑ported and fixed in JDK 25.0.3.

FlinkG1GCJDK

0 likes · 21 min read

Why Upgrading to JDK 25 Broke Spark & Flink Data – Inside the G1GC Bug and Its Fix

Baidu Geek Talk

Mar 23, 2026 · Databases

How Baidu’s MEG Platform Revamped ClickHouse with a Lakehouse Architecture

This article analyzes the challenges of scaling ClickHouse within Baidu’s MEG data platform and details a lake‑house solution that decouples storage and compute, integrates a meta‑service for transparent data access, optimizes query performance through caching, data roll‑up and layout tuning, and introduces a unified query gateway that gracefully falls back to Spark for complex workloads.

ClickHouseData PlatformLakehouse

0 likes · 25 min read

How Baidu’s MEG Platform Revamped ClickHouse with a Lakehouse Architecture

dbaplus Community

Mar 12, 2026 · Databases

How to Migrate 100 Billion ClickHouse Rows to Doris: Three Practical Approaches

This article walks through three concrete methods for moving massive ClickHouse datasets—up to 100 billion rows—to Doris, detailing catalog integration, file export with stream load, and Spark‑based pipelines, while sharing real‑world performance results and pitfalls.

Apache DorisClickHouseData Migration

0 likes · 8 min read

How to Migrate 100 Billion ClickHouse Rows to Doris: Three Practical Approaches

DeWu Technology

Mar 2, 2026 · Big Data

Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases

This article provides a comprehensive guide to Spark UI, explaining each primary and secondary tab, the key metrics they expose, and how to interpret them for performance bottleneck detection, followed by two detailed case studies and practical tuning recommendations for Spark workloads.

Big DataCase StudyMetrics

0 likes · 19 min read

Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases

Architect-Kip

Mar 2, 2026 · Big Data

How to Build a Scalable Tiered Archive & Query System for MySQL Data

This article presents a comprehensive design for a layered storage and unified scheduling platform that archives MySQL historical data, reduces storage costs, ensures high‑performance queries, and enables efficient data analysis through tiered hot, warm, and cold storage using big‑data technologies.

Data ArchivingDorisFlink

0 likes · 13 min read

How to Build a Scalable Tiered Archive & Query System for MySQL Data

Amazon Cloud Developers

Feb 13, 2026 · Big Data

How EMR Serverless Storage Cuts Costs up to 55% for Shuffle‑Heavy Spark Jobs

A performance comparison of Amazon EMR Serverless Storage on a 3 TB TPC‑DS benchmark shows up to 55 % cost reduction and 25 % faster runtimes for shuffle‑intensive Spark jobs, while outlining usage limits and providing Python tools to analyze shuffle data from Spark event logs.

Cost SavingsEMR ServerlessShuffle Storage

0 likes · 13 min read

How EMR Serverless Storage Cuts Costs up to 55% for Shuffle‑Heavy Spark Jobs

Alibaba Cloud Big Data AI Platform

Feb 2, 2026 · Big Data

Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

This article details how Alibaba Cloud EMR Serverless Spark combined with the Paimon lakehouse framework enables Taobao Flash Sale’s retail data team to achieve low‑latency, high‑throughput real‑time analytics, batch processing, and feature generation, outlining architecture evolution, performance gains, and practical Spark tuning techniques.

Big DataLakehousePaimon

0 likes · 18 min read

Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

Big Data Technology Tribe

Jan 20, 2026 · Big Data

Extending Spark SQL with LanceSparkSessionExtensions: A Complete Guide

This article explains how to inject the LanceSpark plugin into Spark, covering the core LanceSparkSessionExtensions class, various ways to register extensions, the custom parser and planner strategy implementations, and the underlying Spark mechanisms such as injectParser, injectPlannerStrategy, and PredicateHelper.

DataSourceV2LanceSparkPlannerStrategy

0 likes · 14 min read

Extending Spark SQL with LanceSparkSessionExtensions: A Complete Guide

ITPUB

Jan 15, 2026 · Databases

How to Migrate ClickHouse Data to Doris: Three Practical Strategies Tested

Facing a ClickHouse cluster shutdown, the author explores three migration methods—using Doris’s ClickHouse catalog, exporting to files with Broker/Stream Load, and Spark—to transfer ~10 billion rows to Doris, evaluating each for simplicity, bugs, and performance, and sharing detailed steps, code snippets, and benchmark results.

ClickHouseData MigrationDoris

0 likes · 9 min read

How to Migrate ClickHouse Data to Doris: Three Practical Strategies Tested

Big Data Tech Team

Jan 5, 2026 · Big Data

Demystifying GitHub AI: Models, Agents, Spaces, Spark, and More

This article explains GitHub's AI ecosystem—Models, Copilot, Agents, Spaces, Spark, Instructions, Skills, and the Model Context Protocol—clarifying each component, their relationships, and practical steps for developers to integrate them into their workflow.

AgentsCopilotGitHub AI

0 likes · 12 min read

Demystifying GitHub AI: Models, Agents, Spaces, Spark, and More

vivo Internet Technology

Dec 10, 2025 · Big Data

Vivo’s 800‑Day Journey Optimizing Celeborn Remote Shuffle Service at PB Scale

This technical report details how Vivo’s big‑data platform adopted Celeborn as its remote shuffle service, evaluated alternatives, tuned hardware and software configurations, implemented performance and stability enhancements, and outlines future operational and community‑driven improvements for handling petabyte‑scale shuffle workloads.

Big DataRemote Shuffle ServiceSpark

0 likes · 20 min read

Vivo’s 800‑Day Journey Optimizing Celeborn Remote Shuffle Service at PB Scale

Alibaba Cloud Big Data AI Platform

Nov 15, 2025 · Big Data

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

This article chronicles a ten‑year evolution of a self‑built big data platform—detailing early Hadoop clusters, successive migrations to Spark, Hive, Hudi, and StarRocks, the operational challenges encountered, and the comprehensive shift to Alibaba Cloud EMR Serverless that delivered significant cost, performance, and stability gains while outlining future intelligent‑ecosystem plans.

Big DataData LakeEMR Serverless

0 likes · 17 min read

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

Instant Consumer Technology Team

Nov 10, 2025 · Big Data

Fixing Multi‑Version, Multi‑Cluster and HA with Apache Kyuubi for Spark/Flink

Apache Kyuubi, an enterprise‑grade multi‑tenant data gateway, replaces Livy and Flink SQL Gateway to support multiple engine versions, cross‑cluster elastic scheduling, high‑availability batch jobs, and traffic control, dramatically reducing deployment complexity, improving resource utilization, and accelerating release cycles for large‑scale Spark and Flink workloads.

Apache KyuubiBig DataData Gateway

0 likes · 18 min read

Fixing Multi‑Version, Multi‑Cluster and HA with Apache Kyuubi for Spark/Flink

Alibaba Cloud Big Data AI Platform

Oct 18, 2025 · Big Data

Alibaba Cloud EMR’s AI Evolution: Accelerating Big Data Performance

Since its 2016 launch, Alibaba Cloud EMR has transformed from a basic open‑source Hadoop service into a high‑performance, AI‑enabled big‑data platform, delivering optimized I/O, vectorized processing, and integrated AI functions such as natural‑language SQL, StarRocks and Spark enhancements, while supporting diverse industry workloads.

Cloud ComputingEMRSpark

0 likes · 9 min read

Alibaba Cloud EMR’s AI Evolution: Accelerating Big Data Performance

Big Data Technology & Architecture

Sep 24, 2025 · Big Data

Avoid These 6 Common Paimon Data Loss Pitfalls in Flink and Spark

Learn the six typical scenarios that cause data loss when writing to Paimon—ranging from checkpoint failures and misconfigured partial‑update mode to incorrect sequence fields, snapshot retention issues, concurrent bucket writes, and outdated Spark versions—and how to prevent each problem.

Big DataCheckpointData loss

0 likes · 5 min read

Avoid These 6 Common Paimon Data Loss Pitfalls in Flink and Spark

DataFunSummit

Sep 21, 2025 · Big Data

Breaking the CPU Wall: BIGO’s Gluten Engine Accelerates Spark and Flink

When big‑data workloads hit the CPU wall, BIGO’s adoption of the open‑source Gluten project delivers native‑engine execution for Spark and a roadmap for Flink, achieving up to 30% end‑to‑end speedup, 50% memory savings, and a scalable, cost‑effective data processing platform.

Big DataFlinkGluten

0 likes · 16 min read

Breaking the CPU Wall: BIGO’s Gluten Engine Accelerates Spark and Flink

Architect's Must-Have

Sep 15, 2025 · Big Data

Mastering Spark Streaming Rate Control: A Deep Dive into Backpressure

This article explains Spark Streaming's rate control mechanisms, covering static limits, the dynamic back‑pressure feature introduced in Spark 1.5, the PID‑based estimator, RPC communication, and how Guava's token‑bucket RateLimiter enforces the calculated thresholds to ensure stability and optimal throughput.

RateControlSparkStreaming

0 likes · 13 min read

Mastering Spark Streaming Rate Control: A Deep Dive into Backpressure

Big Data Tech Team

Aug 25, 2025 · Interview Experience

Essential Big Data Interview Questions for Data Warehouse Engineer Roles

A comprehensive list of interview topics covering self‑introduction, career moves, data‑warehouse design, team building, architecture comparisons, fact‑table classification, common dimensions, performance tuning, and data‑governance for aspiring big‑data engineers.

Big DataData GovernanceFlink

0 likes · 4 min read

Essential Big Data Interview Questions for Data Warehouse Engineer Roles

Big Data Tech Team

Aug 24, 2025 · Big Data

How to De‑Duplicate 1 Billion QQ Numbers Using Under 1 GB of Memory

This article explores multiple techniques—including bitmap indexing, Bloom filters, external sorting, Spark, and layered bitmap structures—to efficiently remove duplicate QQ numbers from a dataset of up to one billion entries while keeping memory usage below a gigabyte and maintaining high accuracy.

DeduplicationSparkbitmap

0 likes · 12 min read

How to De‑Duplicate 1 Billion QQ Numbers Using Under 1 GB of Memory

Architect

Jul 7, 2025 · Big Data

How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×

This article explains how Baidu’s search data team redesigned its data warehouse with wide‑table modeling, Parquet columnar storage, and a Spark‑ClickHouse fusion engine, eliminating redundancy, cutting query latency from minutes to seconds, and enabling self‑service analytics for thousands of users.

Data WarehouseETLParquet

0 likes · 21 min read

How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×

Big Data Technology & Architecture

Jul 4, 2025 · Big Data

Spark 4.0: New Features, Performance Gains, and Why It Still Leads Big Data

Despite the hype around Flink and AI models, Spark 4.0’s release brings a lightweight Python client, Spark Connect GA, enhanced SQL optimization, vectorized execution, and AI integration, reaffirming its leading position in the big‑data ecosystem while hinting at future challenges and innovations.

Big DataData EngineeringPerformance Optimization

0 likes · 6 min read

Spark 4.0: New Features, Performance Gains, and Why It Still Leads Big Data

Alibaba Cloud Big Data AI Platform

Jun 10, 2025 · Big Data

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

This article details how a leading automotive parts supply‑chain platform migrated from a traditional Hadoop stack to Alibaba Cloud EMR Serverless Spark and DataWorks, achieving faster, more elastic, and cost‑effective data processing, enhanced AI integration, and significant operational improvements across multiple business scenarios.

Big DataCloud NativeData Lake

0 likes · 12 min read

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

Big Data Technology & Architecture

May 15, 2025 · Big Data

Interview Review: Spark Stage Logic, Data Warehouse Evaluation, and Flink Late‑Data Handling

This article reviews common interview questions for data development roles, covering Spark stage partitioning and optimization, criteria for evaluating data warehouses, Flink's handling of late data, and provides practical answers and resources to help candidates deliver standout responses.

Big DataData QualityData Warehouse

0 likes · 11 min read

Interview Review: Spark Stage Logic, Data Warehouse Evaluation, and Flink Late‑Data Handling

StarRocks

May 8, 2025 · Backend Development

How Grab Supercharged Spark Observability 10× with StarRocks – Inside the Iris Architecture

Grab replaced its fragmented Grafana‑Superset stack with a StarRocks‑backed Iris platform, achieving over ten‑fold query speedups, 40% lower resource usage, and a unified real‑time and historical data store for Spark observability across its Southeast Asian super‑app ecosystem.

Data PlatformMaterialized ViewsObservability

0 likes · 16 min read

How Grab Supercharged Spark Observability 10× with StarRocks – Inside the Iris Architecture

Big Data Technology & Architecture

Apr 28, 2025 · Big Data

Interview Insights on Spark Optimization, Flink Exactly-Once Semantics, and Paimon Asynchronous Merging

This article shares three high‑quality interview questions from a JD big‑data interview, covering practical Spark tuning, Flink's exactly‑once guarantees in production, and Paimon's asynchronous merge mechanism, and explains how to answer them with real‑world scenarios.

Big DataFlinkPaimon

0 likes · 6 min read

Interview Insights on Spark Optimization, Flink Exactly-Once Semantics, and Paimon Asynchronous Merging

Big Data Tech Team

Apr 17, 2025 · Big Data

Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions

This article compiles a comprehensive set of Spark interview questions frequently asked by leading tech companies, providing detailed explanations of Spark’s performance mechanisms, architecture, RDD persistence, checkpointing, streaming, dependency types, HA setup, and practical coding examples to help data warehouse engineers prepare effectively.

Data WarehouseRDDSpark

0 likes · 21 min read

Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions

vivo Internet Technology

Apr 16, 2025 · Big Data

Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

The article explains how the vivo Internet Big Data team containerized offline Spark jobs and deployed them with the Spark Operator on a mixed online‑offline Kubernetes cluster, using elastic scheduling and resource‑over‑subscription to boost CPU utilization by 30‑40% and handle over 100,000 daily tasks.

Big DataResource ManagementSpark

0 likes · 36 min read

Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

DataFunSummit

Apr 3, 2025 · Big Data

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

The Apache Hudi Asia technical salon held in Beijing on March 29 gathered over 230 on‑site participants and 16,000 online viewers, featuring expert talks from leading Chinese tech companies that showcased real‑world Hudi implementations, performance optimizations, and future roadmap for data‑lake technologies.

Apache HudiBig DataData Lake

0 likes · 13 min read

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

iQIYI Technical Product Team

Mar 27, 2025 · Big Data

Cost‑Effective Real‑Time Data Warehouse 2.0: Migrating from Kafka to Iceberg

iQIYI transformed its real‑time data warehouse by replacing a costly Kafka‑based Lambda stack with a unified stream‑batch Iceberg lake, cutting storage expenses by 90%, halving compute costs, extending data retention, and delivering minute‑level freshness for 90% of use cases while preserving second‑level processing where needed.

FlinkIcebergReal-Time Data Warehouse

0 likes · 11 min read

Cost‑Effective Real‑Time Data Warehouse 2.0: Migrating from Kafka to Iceberg

Alibaba Cloud Big Data AI Platform

Mar 20, 2025 · Big Data

How to Read and Write StarRocks Data with EMR Serverless Spark

This step‑by‑step guide explains how to use EMR Serverless Spark together with the StarRocks Spark Connector to create a workspace, upload the connector JAR, configure network connections, create databases and tables in StarRocks, and perform read/write operations via SQL sessions, Notebook sessions, or batch Spark jobs, complete with code examples and UI screenshots.

Big DataData IntegrationEMR Serverless

0 likes · 14 min read

How to Read and Write StarRocks Data with EMR Serverless Spark

Big Data Technology & Architecture

Mar 3, 2025 · Big Data

The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering

The article analyzes how the rapid rise of open‑source large‑model AI in 2025 is reshaping the data development profession, urging developers to transition from specialized data‑engineer roles to full‑stack AI data engineering skills such as distributed computing, lake‑house architectures, and model tuning.

AIBig DataData Engineering

0 likes · 7 min read

The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering

DataFunSummit

Feb 22, 2025 · Big Data

Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL

The article introduces Blaze, Kuaishou's Rust‑powered native execution engine that vectorizes Spark SQL workloads, explains its architecture and operation, presents benchmark results showing up to 50% latency reduction, and details internal deployments, industry case studies, community collaborations, and the 2025 roadmap.

Big DataPerformance OptimizationSpark

0 likes · 12 min read

Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL

Su San Talks Tech

Feb 21, 2025 · Databases

How to Migrate 1 Billion Records Efficiently: Strategies, Code, and Pitfalls

This article shares a step‑by‑step guide for migrating billions of rows safely and quickly, covering divide‑and‑conquer batching, dual‑write architectures, tool selection, shadow testing, and rollback plans, with concrete Java and Spark code examples and practical pitfalls to avoid.

Big DataData MigrationDatabases

0 likes · 10 min read

How to Migrate 1 Billion Records Efficiently: Strategies, Code, and Pitfalls

DataFunTalk

Feb 20, 2025 · Big Data

From Integrated Storage‑Compute to Decoupled Architecture: Practical Exploration of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue in Big Data Platforms

This article analyzes the transition from a tightly coupled storage‑compute architecture to a decoupled model, detailing how Kubernetes, Kyuubi, Celeborn, Blaze, and Hue together solve resource inefficiencies, improve scalability, and boost query performance in modern big‑data environments.

Big DataBlazeKyuubi

0 likes · 16 min read

From Integrated Storage‑Compute to Decoupled Architecture: Practical Exploration of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue in Big Data Platforms

21CTO

Feb 4, 2025 · Big Data

Why Python Beats Java and Scala for Modern Data Engineering

The article compares Java, Scala, SQL, and Python for data‑engineering tasks, arguing that Python’s versatility, rich ecosystem, and ease of use make it the preferred language for both small‑scale and massive Spark workloads despite its performance trade‑offs.

Big DataData EngineeringSQL

0 likes · 7 min read

Why Python Beats Java and Scala for Modern Data Engineering

DataFunSummit

Feb 1, 2025 · Big Data

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

This article explains the challenges of big‑data processing in the cloud era, introduces Spark’s native‑language SQL engine rewrites, discusses vectorization and code generation techniques, describes cloud‑native storage‑compute separation with Remote Shuffle services such as Apache Celeborn, and presents the production benefits of Alibaba Cloud’s EMR Serverless Spark.

Big DataCodegenEMR Serverless

0 likes · 12 min read

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

Alibaba Cloud Big Data AI Platform

Jan 26, 2025 · Big Data

How a FinTech Scaled Its Data Platform with Alibaba Cloud EMR Serverless Spark

Weifin, a fintech innovator, tackled massive data‑scale challenges by adopting Alibaba Cloud EMR Serverless Spark, building a unified Spark‑based platform that supports data collection, lake ingestion, distributed machine‑learning training, and intelligent risk‑control applications, while achieving performance gains, cost reduction, and scalable automation.

FinTechSparkmachine learning

0 likes · 10 min read

How a FinTech Scaled Its Data Platform with Alibaba Cloud EMR Serverless Spark

Airbnb Technology Team

Jan 24, 2025 · Artificial Intelligence

Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning

Chronon is an open‑source framework that centralizes feature definitions to guarantee training‑inference consistency, eliminates complex ETL pipelines, and supports real‑time and batch processing across diverse data sources, cutting feature‑development cycles from months to under a week, as demonstrated by Airbnb’s 40,000‑feature deployment.

ChrononHiveReal-time Data

0 likes · 10 min read

Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning

dbaplus Community

Jan 19, 2025 · Big Data

How to Write Elegant, High‑Performance SQL for Big Data Pipelines

This article shares practical techniques for writing clean, efficient SQL in large‑scale data environments, covering predicate pushdown, sub‑queries, deduplication strategies, bucket optimization, and automation with Python‑Spark integration to improve readability and execution speed.

HiveOptimizationSpark

0 likes · 14 min read

How to Write Elegant, High‑Performance SQL for Big Data Pipelines

DataFunSummit

Jan 16, 2025 · Big Data

Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service

This article details Zhihu's comprehensive cost‑reduction and efficiency‑boosting initiatives for its big‑data platform, covering FinOps‑driven financial operations, hybrid‑cloud architecture, cost allocation models, operational monitoring, and technical optimizations such as erasure coding, ZSTD compression, Spark auto‑tuning, and a remote shuffle service.

Big DataCloud Cost ManagementFinOps

0 likes · 22 min read

Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service

DataFunSummit

Jan 14, 2025 · Big Data

Tencent Real-Time Lakehouse Intelligent Optimization Practice

This presentation details Tencent's real‑time lakehouse architecture and the four key topics—lakehouse design, intelligent optimization services, scenario‑driven capabilities, and future outlook—covering components such as Spark, Flink, Iceberg, Auto‑Optimize Service, indexing, clustering, AutoEngine, and PyIceberg implementations.

Auto OptimizeBig DataFlink

0 likes · 12 min read

Tencent Real-Time Lakehouse Intelligent Optimization Practice

DataFunSummit

Jan 3, 2025 · Big Data

Tencent Real‑Time Lakehouse Intelligent Optimization Practices

This article presents Tencent's end‑to‑end real‑time lakehouse architecture, detailing its three‑layer design, the Auto Optimize Service modules such as compaction, indexing, clustering and engine acceleration, as well as scenario‑driven capabilities like multi‑stream joins, primary‑key tables, in‑place migration and PyIceberg support, and concludes with future optimization directions.

Big DataFlinkIceberg

0 likes · 11 min read

Tencent Real‑Time Lakehouse Intelligent Optimization Practices

Bilibili Tech

Jan 3, 2025 · Big Data

Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili

Bilibili replaced Spark’s unstable External Shuffle Service with a push‑based approach, then deployed Apache Celeborn’s remote shuffle on Kubernetes using HA masters, tiered workers, extensive monitoring, history‑based routing, chaos testing, and seamless Spark, Flink, and MapReduce integration, while planning self‑healing, elastic scaling, and priority‑aware I/O enhancements.

Apache CelebornBig DataFlink

0 likes · 28 min read

Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili

Big Data Technology & Architecture

Jan 2, 2025 · Big Data

Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details

This article provides a comprehensive overview of Apache Paimon, covering its real‑time lake ingestion, unified stream‑batch processing, table types (primary‑key and append‑only), LSM‑tree storage, bucket mechanisms, merge‑engine options, compaction strategies, concurrency control, consumption methods, tag management, data cleanup, and system tables for big‑data workloads.

Apache PaimonBig DataFlink

0 likes · 25 min read

Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details

Big Data Technology & Architecture

Dec 31, 2024 · Big Data

Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables

This article explains how Spark ≥ 3.3 introduces Storage Partitioned Join (SPJ) to avoid costly shuffle operations when joining partitioned V2 source tables such as Apache Iceberg, detailing the required conditions, configuration settings, practical code examples, and various join scenarios including mismatched partitions and data skew.

BucketingData SkewSQL

0 likes · 15 min read

Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables

JD Tech

Dec 30, 2024 · Big Data

Techniques for Writing Elegant and Efficient SQL in Big Data Environments

The article shares practical methods and code examples for making SQL both readable and high‑performing in large‑scale data platforms, covering predicate push‑down with subqueries, deduplication strategies, bucket utilization, and Python‑driven job parameter handling.

Data EngineeringHiveSQL

0 likes · 14 min read

Techniques for Writing Elegant and Efficient SQL in Big Data Environments

DataFunSummit

Dec 27, 2024 · Big Data

Tencent Real-time Lakehouse Intelligent Optimization Practice

This presentation describes Tencent's real-time lakehouse architecture, including data lake compute, management, and storage layers, and details the intelligent optimization services—such as compaction, indexing, clustering, and auto-engine—designed to improve query performance, storage cost, and operational efficiency for large-scale data processing.

AutoEngineCompactionFlink

0 likes · 11 min read

Bilibili Tech

Dec 27, 2024 · Big Data

Consistency Architecture for Bilibili Recommendation Model Data Flow

The article outlines Bilibili’s revamped recommendation data‑flow architecture that eliminates timing and calculation inconsistencies by snapshotting online features, unifying feature computation in a single C++ library accessed via JNI, and orchestrating label‑join and sample extraction through near‑line Kafka/Flink pipelines, with further performance gains and Iceberg‑based future extensions.

Data ConsistencyFlinkIceberg

0 likes · 12 min read

Consistency Architecture for Bilibili Recommendation Model Data Flow

Past Memory Big Data

Dec 27, 2024 · Big Data

How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet

Uber’s data lake on Hadoop stores hundreds of petabytes in Parquet files and, by adopting ZSTD compression, column pruning, and column reordering, achieves up to 79% storage reduction and significant vCore savings, with detailed benchmarks guiding optimal compression levels and open‑source contributions.

Apache ParquetBig DataHadoop

0 likes · 14 min read

How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet

Past Memory Big Data

Dec 26, 2024 · Big Data

Eliminate Shuffle: Deep Dive into Spark’s Storage Partition Join (SPJ)

This article explains how Spark ≥ 3.3’s Storage Partition Join (SPJ) can avoid costly shuffle operations by using Iceberg tables, outlines the required table properties and Spark configurations, demonstrates the effect with code examples and execution plans, and explores several realistic join scenarios.

Apache IcebergBig DataSPJ

0 likes · 16 min read

Eliminate Shuffle: Deep Dive into Spark’s Storage Partition Join (SPJ)

dbaplus Community

Dec 24, 2024 · Big Data

How Bilibili Scaled Its Tag System for Massive Data and Real‑Time Accuracy

The article details Bilibili's comprehensive redesign of its tag system—including background challenges, architectural layers, technical upgrades like Iceberg integration and shard‑based ClickHouse writes, crowd selection methods, online service guarantees, performance metrics, and future plans—showcasing a data‑driven solution that boosts stability, speed, and business coverage.

ClickHouseData EngineeringDistributed Computing

0 likes · 24 min read

How Bilibili Scaled Its Tag System for Massive Data and Real‑Time Accuracy

Past Memory Big Data

Dec 24, 2024 · Big Data

Magnet: A Push‑Based Shuffle Service that Scales to Petabyte‑Level Data Processing

LinkedIn’s massive Spark workloads suffer from shuffle bottlenecks caused by tiny shuffle blocks, unreliable RPC connections, and data skew, so the authors design Magnet—a push‑merge shuffle service that merges blocks into large chunks, improves disk I/O, tolerates failures, and cuts end‑to‑end job time by nearly 30% regardless of hardware.

Disk I/O optimizationLarge‑scale data processingPush‑based service

0 likes · 56 min read

Magnet: A Push‑Based Shuffle Service that Scales to Petabyte‑Level Data Processing

dbaplus Community

Dec 14, 2024 · Databases

Why a Database‑First Operating System Could Replace Linux and Kubernetes

The article examines the DBOS concept—a database‑oriented operating system that places a distributed, transactional database at the core of OS services, tracing its roots from early database pioneers to modern cloud workloads and highlighting its potential advantages over traditional Linux‑Kubernetes stacks.

Cloud ComputingDBOSSpark

0 likes · 10 min read

Why a Database‑First Operating System Could Replace Linux and Kubernetes

Qunar Tech Salon

Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataHiveMapReduce

0 likes · 23 min read

Understanding and Solving Small File Problems in Hive and Spark

Tongcheng Travel Technology Center

Nov 27, 2024 · Big Data

Highlights of Tongcheng Travel’s 8th Big Data Technology Salon

The 8th Tongcheng Travel Big Data Technology Salon in Suzhou featured four expert talks covering Tencent Cloud’s Meson Spark engine, near‑line computing for travel itineraries, a Flink‑based real‑time risk control system, and Apache Paimon’s latest lake‑warehouse innovations, followed by a data‑driven business perspective session.

Apache PaimonBig DataData Lake

0 likes · 7 min read

Highlights of Tongcheng Travel’s 8th Big Data Technology Salon

Bilibili Tech

Nov 12, 2024 · Big Data

Scalable Tag System Architecture and Optimization

The rebuilt tag system introduces a three‑layer architecture, standard pipelines, Iceberg‑backed storage and custom ClickHouse sharding, a DSL for crowd selection, and a stateless online service, achieving 99.9% success, sub‑5 ms latency, and supporting thousands of tags across dozens of business scenarios while planning real‑time processing and automated lifecycle management.

ClickHouseIcebergOnline Service

0 likes · 23 min read

Scalable Tag System Architecture and Optimization

Past Memory Big Data

Nov 8, 2024 · Big Data

How Spark on Kubernetes Transformed Duodian DMALL’s Big Data Platform

The article details Duodian DMALL’s migration from a traditional Hadoop stack to a cloud‑native Spark‑on‑Kubernetes architecture, explaining the motivations, design choices, component selections, operational challenges, and lessons learned through concrete examples and performance observations.

Apache CelebornBig DataCloud Native

0 likes · 21 min read

How Spark on Kubernetes Transformed Duodian DMALL’s Big Data Platform

Bilibili Tech

Nov 1, 2024 · Big Data

Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform

Magnus is Bilibili’s self‑developed intelligent service that continuously optimizes Iceberg tables by scheduling snapshot expiration, orphan‑file cleanup, manifest rewriting, and multi‑dimensional data optimizations—including small‑file merging, sorting, distribution, and index creation—while automatically recommending configurations from real‑time query logs, delivering over 99.9% task success and up to 30% scan‑data reduction.

Data LakeIcebergIntelligent Recommendation

0 likes · 15 min read

Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform

Open Source Tech Hub

Oct 31, 2024 · Big Data

How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark

Bilibili transformed its search indexing pipeline by replacing a manual, low‑throughput process with a distributed KV store (Taishan) and Spark‑based construction, achieving unified data ingestion, reduced resource consumption, faster full‑ and incremental builds, and a shift from daily to hourly indexing cycles.

Big DataDistributed storageIndexing

0 likes · 25 min read

How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark

Baobao Algorithm Notes

Oct 25, 2024 · Artificial Intelligence

How Simhash and Minhash Power LLM Data Deduplication: Theory and Spark Code

This article explains document‑level, paragraph‑level, and sentence‑level deduplication for large‑scale LLM pre‑training, introduces the Simhash and Minhash algorithms with step‑by‑step Python examples, and shows how to implement efficient LSH‑based deduplication using Spark.

LLMMinhashPython

0 likes · 29 min read

How Simhash and Minhash Power LLM Data Deduplication: Theory and Spark Code

DataFunSummit

Oct 24, 2024 · Big Data

Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment

This article details Bilibili’s implementation of a large‑language‑model‑driven intelligent assistant for its massive big‑data platform, covering background, problem analysis, architectural design, knowledge‑base construction, precision and recall challenges, deployment across offline and real‑time Spark/Flink diagnostics, and future outlooks.

AgentBig DataFlink

0 likes · 23 min read

Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment

Java Architecture Stack

Oct 18, 2024 · Big Data

How to Fix Spark OOM Errors: Practical Memory & Performance Tuning

This guide analyzes common Spark Out‑Of‑Memory scenarios—such as massive data volumes, data skew, and improper resource allocation—and provides step‑by‑step configurations, memory‑management tweaks, partitioning strategies, and shuffle optimizations to prevent OOM failures in production.

Big DataOOMPerformance Optimization

0 likes · 8 min read

How to Fix Spark OOM Errors: Practical Memory & Performance Tuning

JD Retail Technology

Oct 11, 2024 · Big Data

JD Retail Data Lake Architecture: Challenges, Optimizations, and Future Plans

This article presents JD Retail's data lake architecture overhaul, detailing the shortcomings of the Lambda model, the migration to Flink‑Hudi‑Spark pipelines, performance gains, storage savings, unified APIs, and upcoming improvements for resilience and automation.

Big DataData LakeFlink

0 likes · 11 min read

JD Retail Data Lake Architecture: Challenges, Optimizations, and Future Plans

DataFunSummit

Oct 8, 2024 · Big Data

Understanding Spark SQL Analyzer: Principles, Optimization Cases, and Rule‑Pruning in Spark 3.2+

This article explains the Spark SQL analysis layer, its core principles, how analysis rules such as ResolveRelations work, and the major pruning optimization introduced in Spark 3.2 that reduces unnecessary rule traversal, illustrated with concrete code examples and Q&A.

Big DataOptimizationRule Engine

0 likes · 20 min read

Understanding Spark SQL Analyzer: Principles, Optimization Cases, and Rule‑Pruning in Spark 3.2+

DataFunSummit

Sep 30, 2024 · Big Data

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

This article introduces Apache Hudi Streamer, a versatile Spark‑based data ingestion tool likened to a Swiss Army knife, detailing its core options—including table configuration, continuous mode, source classes, transformers, table services, catalog synchronization, and advanced features—while guiding users on practical pipeline setup.

Apache HudiBig DataSpark

0 likes · 10 min read

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

Big Data Technology & Architecture

Sep 25, 2024 · Big Data

Learning Strategies and Interview Preparation Insights from a Big Data Student

The article shares practical study habits, detailed note‑taking, proactive questioning, effective communication, and a comprehensive set of interview questions covering Hive, Spark, Kafka, Flink, and other big‑data technologies, illustrated with real examples from a diligent student’s experience.

HiveLearning StrategiesSpark

0 likes · 7 min read

Learning Strategies and Interview Preparation Insights from a Big Data Student

Architect

Sep 24, 2024 · Industry Insights

How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround

This article details Bilibili's transformation of its search offline indexing architecture—from a manual, low‑throughput MySQL‑centric process to a distributed, KV‑based, protobuf‑driven pipeline that leverages Taishan storage and Spark, cutting build cycles from days to hours while solving performance, consistency, and maintenance challenges.

Big DataIndexingSearch

0 likes · 24 min read

How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround

Kuaishou Tech

Sep 13, 2024 · Big Data

Blaze: Kuaishou’s Rust‑Based Vectorized Execution Engine for Spark SQL

Blaze is a Rust‑implemented, DataFusion‑based vectorized execution engine created by Kuaishou to accelerate Spark SQL queries, delivering up to 60% faster computation, 30% average compute‑power gains in production, and extensive architectural innovations such as native engine, protobuf protocol, JNI bridge, and Spark extension, while being open‑source and compatible with Spark 3.0‑3.5.

Big DataDataFusionSpark

0 likes · 11 min read

Blaze: Kuaishou’s Rust‑Based Vectorized Execution Engine for Spark SQL

dbaplus Community

Sep 4, 2024 · Big Data

How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

This article details how Ctrip’s data platform evolved from a single‑IDC design to a multi‑IDC, tiered storage and scheduling architecture, covering the challenges of rapid data growth, the migration to Spark 3 via Kyuubi, the introduction of Celeborn shuffle service, and the resulting performance and reliability gains.

Big DataDistributed storageHDFS

0 likes · 23 min read

How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

DataFunSummit

Aug 17, 2024 · Big Data

AnalyticDB Spark Architecture and Vectorized Engine Performance Overview

This article introduces the AnalyticDB Spark architecture, explains the need for Spark vectorization, surveys industry vectorized solutions, details ADB Spark's own vectorized implementation with Gluten and Velox, and presents performance test results showing a 6.98‑fold speedup over open‑source Spark.

AnalyticDBBig DataGluten

0 likes · 9 min read

AnalyticDB Spark Architecture and Vectorized Engine Performance Overview

Bilibili Tech

Aug 13, 2024 · Big Data

How Bilibili Re‑engineered Its Search Indexing with Distributed Storage and Spark

This article details Bilibili's transformation of its search offline indexing pipeline, moving from manual MySQL‑based processes to a high‑capacity, distributed KV store and Spark‑driven builds, addressing performance, maintenance, and scalability challenges while improving resource efficiency and iteration speed.

Big DataBilibiliDistributed storage

0 likes · 24 min read

How Bilibili Re‑engineered Its Search Indexing with Distributed Storage and Spark

DataFunSummit

Aug 3, 2024 · Big Data

Apache Hudi Write Process: From Zero to One – Part 3 (Understanding Write Flow and Operations)

This article explains the complete Apache Hudi write pipeline, detailing each step from client creation to commit, and describes the various write operations such as Upsert, Insert, Bulk Insert, Delete, Delete Partition, and Insert‑Overwrite, providing a comprehensive overview for data‑lake practitioners.

Apache HudiBig DataData Lake

0 likes · 12 min read

Apache Hudi Write Process: From Zero to One – Part 3 (Understanding Write Flow and Operations)

DataFunTalk

Jul 23, 2024 · Big Data

Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms

This article shares detailed practical experiences from DingXiangYuan's big‑data platform on using Apache Kyuubi and Apache Celeborn, covering architecture, flexible configuration, AuthZ fine‑grained permissions, small‑file and Z‑Order optimizations, Arrow‑based large result transmission, and operational tips such as connection‑level issues and Netty cache handling.

Apache CelebornApache KyuubiArrow

0 likes · 17 min read

Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms

Mike Chen's Internet Architecture

Jul 15, 2024 · Big Data

Master Distributed Computing: Hadoop, Spark, and Flink Explained

This article introduces the fundamentals of distributed computing, compares major frameworks such as Hadoop, Spark, and Flink, and outlines their key components, performance characteristics, and typical application scenarios including big‑data analytics, cloud services, real‑time streaming, and scientific computing.

Big DataDistributed ComputingFlink

0 likes · 7 min read

Master Distributed Computing: Hadoop, Spark, and Flink Explained

DataFunSummit

Jul 13, 2024 · Big Data

Blaze: A Native Vectorized Execution Engine for Spark – Architecture, Production Optimizations, and Future Plans

Blaze is Kuaishou's self‑developed native execution engine that leverages Rust, DataFusion, and SIMD vectorization to accelerate Spark workloads, offering a 30%+ compute boost, detailed architectural components, deep production‑grade optimizations, and a roadmap for broader adoption.

Big DataDataFusionPerformance Optimization

0 likes · 13 min read

Blaze: A Native Vectorized Execution Engine for Spark – Architecture, Production Optimizations, and Future Plans

360 Smart Cloud

Jul 9, 2024 · Big Data

Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)

This article examines the critical role of shuffle in big‑data processing, compares Spark's native shuffle with the External Shuffle Service (ESS) and Remote Shuffle Service (RSS) solutions, introduces Uniffle's architecture and configuration, and shares practical deployment experiences and performance results within a 360 internal environment.

Big DataExternal Shuffle ServiceRemote Shuffle Service

0 likes · 15 min read

Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)

Baidu Geek Talk

Jul 8, 2024 · Big Data

Evolution of Feed Data Warehouse Wide-Table Modeling at Baidu App

Baidu’s Mobile Ecology team transformed its Feed data warehouse through three progressive stages—hour‑level core tables, a real‑time wide table, and a unified day‑level multi‑version table—consolidating traffic, content, and user data into a single partitioned wide‑table architecture that resolves granularity inconsistencies, cuts processing cost, and delivers real‑time to daily latency for diverse analytics.

Real-timeSparkWide Table

0 likes · 10 min read

Evolution of Feed Data Warehouse Wide-Table Modeling at Baidu App

Code Ape Tech Column

Jul 6, 2024 · Big Data

Learning Spark Operations with Java Stream Concepts: Map, FlatMap, GroupBy, Reduce Examples

This article demonstrates how Java Stream operations such as map, flatMap, groupBy, and reduce can be directly applied to Spark, providing step‑by‑step code examples, explanations of transformation versus action operators, and practical tips for handling exceptions in distributed data processing.

Java StreamSparkflatMap

0 likes · 25 min read

Learning Spark Operations with Java Stream Concepts: Map, FlatMap, GroupBy, Reduce Examples

DataFunSummit

Jun 28, 2024 · Big Data

Apache Hudi from Zero to One – Part 2: Reading Process and Query Types (Spark Example)

This article explains how Apache Hudi integrates with Spark to read data, detailing the Spark‑SQL planning stages, the Spark‑Hudi read workflow, and the four main Hudi query types—snapshot, read‑optimized, time‑travel, and incremental—along with example SQL commands and code snippets.

Apache HudiBig DataData Lake

0 likes · 11 min read

Apache Hudi from Zero to One – Part 2: Reading Process and Query Types (Spark Example)

DataFunTalk

Jun 28, 2024 · Big Data

Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation

This article presents a comprehensive technical overview of using ClickHouse as a native backend to accelerate Spark SQL execution, covering Spark performance bottlenecks, ClickHouse's CPU‑level optimizations, the design and implementation of the Spark‑Native integration, and detailed TPC‑DS benchmark results demonstrating up to 3.5× speedup.

Big DataClickHousePerformance Optimization

0 likes · 33 min read

Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation

Baidu Geek Talk

Jun 24, 2024 · Big Data

Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

The paper presents a Spark acceleration framework that replaces Java‑based task operators with a ClickHouse native library, converting plans via Protobuf and JNI, leveraging columnar storage, SIMD and JIT to achieve up to 3× speed‑up on TPC‑DS workloads while providing fallback mechanisms to ensure no performance loss.

Big DataClickHouseNative Acceleration

0 likes · 31 min read

Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

Baidu Intelligent Cloud Tech Hub

Jun 24, 2024 · Big Data

Boost Spark Performance with ClickHouse: Native Acceleration Techniques

This article presents a detailed technical overview of accelerating Spark's compute engine using ClickHouse as a native backend, covering Spark performance background, ClickHouse's advantages, the design and implementation of a Spark‑Native acceleration solution, and extensive performance evaluation results.

ClickHouseNative AccelerationPerformance Optimization

0 likes · 34 min read

Boost Spark Performance with ClickHouse: Native Acceleration Techniques

DataFunTalk

Jun 22, 2024 · Big Data

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

This article details Zhihu's migration of massive Spark and MapReduce shuffle workloads from the External Shuffle Service (ESS) to a push‑based Remote Shuffle Service (RSS) powered by Celeborn, covering background problems, evaluation of open‑source implementations, deployment architecture, encountered issues, solutions, performance gains, and future plans.

Big DataRSSShuffle

0 likes · 19 min read

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

Past Memory Big Data

Jun 20, 2024 · Big Data

How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox

This article details Meituan's production‑grade adoption of Spark vectorized execution via the open‑source Gluten and Velox stack, explaining SIMD fundamentals, performance motivations, the end‑to‑end integration workflow, staged rollout, encountered challenges, and the resulting resource savings and speedups.

Big DataGlutenORC

0 likes · 33 min read

How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox