Tagged articles
607 articles
Page 1 of 7
DataFunTalk
DataFunTalk
May 11, 2026 · Big Data

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse built on Iceberg, StarRocks, Flink and Spark, cutting architecture complexity, resource and development costs by two‑thirds while supporting trillions of daily events with sub‑second query latency.

Big DataClickHouseFlink
0 likes · 22 min read
How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era
DataFunTalk
DataFunTalk
May 6, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's four‑stage data‑platform evolution—from a simple ClickHouse ad‑hoc setup to a Lambda‑based 2.0 design and finally a lakehouse‑driven 3.0 architecture—highlighting the adoption of general incremental compute, cost‑reduction to one‑third, performance gains of up to ten‑fold, and the SPOT standards that guide the new system.

Big DataClickHouseData Architecture
0 likes · 21 min read
How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era
DataFunTalk
DataFunTalk
Apr 29, 2026 · Big Data

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based analytics stack to a unified lakehouse with generic incremental compute, cutting architecture complexity, resource cost, and development effort by roughly one‑third while supporting petabyte‑scale, sub‑second queries across its 350 million‑user app.

Big DataClickHouseData Architecture
0 likes · 22 min read
How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 29, 2026 · Big Data

Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision

The article dissects a credit data system architecture, detailing six logical layers—from multi-source data collection and feature engineering (including graph features and feature stores) to model training, real‑time stream processing, decision engine integration, and privacy‑preserving computation—while explaining the trade‑offs, tools, and performance targets needed for accurate, low‑latency risk assessment.

Credit ScoringFeature StoreFlink
0 likes · 16 min read
Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision
Big Data Tech Team
Big Data Tech Team
Apr 8, 2026 · Interview Experience

Master Spark Tuning for Data Warehouse Interviews: Real Cases & Tips

Learn how to demonstrate real Spark optimization skills in data‑warehouse interviews by exploring two detailed case studies—small‑file merging in ODS‑to‑DWD ETL and shuffle‑skew mitigation in DWS aggregation—plus key interview questions and practical troubleshooting steps that separate theory from hands‑on expertise.

Data WarehouseInterview TipsSpark
0 likes · 9 min read
Master Spark Tuning for Data Warehouse Interviews: Real Cases & Tips
Baidu Geek Talk
Baidu Geek Talk
Mar 23, 2026 · Databases

How Baidu’s MEG Platform Revamped ClickHouse with a Lakehouse Architecture

This article analyzes the challenges of scaling ClickHouse within Baidu’s MEG data platform and details a lake‑house solution that decouples storage and compute, integrates a meta‑service for transparent data access, optimizes query performance through caching, data roll‑up and layout tuning, and introduces a unified query gateway that gracefully falls back to Spark for complex workloads.

ClickHouseData PlatformLakehouse
0 likes · 25 min read
How Baidu’s MEG Platform Revamped ClickHouse with a Lakehouse Architecture
DeWu Technology
DeWu Technology
Mar 2, 2026 · Big Data

Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases

This article provides a comprehensive guide to Spark UI, explaining each primary and secondary tab, the key metrics they expose, and how to interpret them for performance bottleneck detection, followed by two detailed case studies and practical tuning recommendations for Spark workloads.

Big DataCase StudySpark
0 likes · 19 min read
Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases
Architect-Kip
Architect-Kip
Mar 2, 2026 · Big Data

How to Build a Scalable Tiered Archive & Query System for MySQL Data

This article presents a comprehensive design for a layered storage and unified scheduling platform that archives MySQL historical data, reduces storage costs, ensures high‑performance queries, and enables efficient data analysis through tiered hot, warm, and cold storage using big‑data technologies.

FlinkHiveSpark
0 likes · 13 min read
How to Build a Scalable Tiered Archive & Query System for MySQL Data
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 2, 2026 · Big Data

Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

This article details how Alibaba Cloud EMR Serverless Spark combined with the Paimon lakehouse framework enables Taobao Flash Sale’s retail data team to achieve low‑latency, high‑throughput real‑time analytics, batch processing, and feature generation, outlining architecture evolution, performance gains, and practical Spark tuning techniques.

Big DataLakehousePaimon
0 likes · 18 min read
Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale
Big Data Technology Tribe
Big Data Technology Tribe
Jan 20, 2026 · Big Data

Extending Spark SQL with LanceSparkSessionExtensions: A Complete Guide

This article explains how to inject the LanceSpark plugin into Spark, covering the core LanceSparkSessionExtensions class, various ways to register extensions, the custom parser and planner strategy implementations, and the underlying Spark mechanisms such as injectParser, injectPlannerStrategy, and PredicateHelper.

DataSourceV2LanceSparkPlannerStrategy
0 likes · 14 min read
Extending Spark SQL with LanceSparkSessionExtensions: A Complete Guide
ITPUB
ITPUB
Jan 15, 2026 · Databases

How to Migrate ClickHouse Data to Doris: Three Practical Strategies Tested

Facing a ClickHouse cluster shutdown, the author explores three migration methods—using Doris’s ClickHouse catalog, exporting to files with Broker/Stream Load, and Spark—to transfer ~10 billion rows to Doris, evaluating each for simplicity, bugs, and performance, and sharing detailed steps, code snippets, and benchmark results.

ClickHouseData MigrationSQL
0 likes · 9 min read
How to Migrate ClickHouse Data to Doris: Three Practical Strategies Tested
Big Data Tech Team
Big Data Tech Team
Jan 5, 2026 · Big Data

Top 10 Data Warehouse Interview Questions Every 2026 Engineer Must Master

This article compiles the most frequently asked interview questions for 2026 data‑warehouse development engineers, covering core concepts, layer architecture, SQL optimization, window functions, Hive vs Spark, data skew solutions, modeling metrics, slowly changing dimensions, scheduling tools, data quality monitoring, and real project experience.

Data WarehouseHiveSQL Optimization
0 likes · 8 min read
Top 10 Data Warehouse Interview Questions Every 2026 Engineer Must Master
DevOps Engineer
DevOps Engineer
Dec 27, 2025 · Artificial Intelligence

Demystifying GitHub AI: Models, Agents, Spaces, Spark, and More

This article explains GitHub's AI ecosystem—Models, Copilot, Agents, Spaces, Spark, Instructions, Skills, and the Model Context Protocol—clarifying each component, their relationships, and practical steps for developers to integrate them into their workflow.

CopilotGitHub AIMCP
0 likes · 12 min read
Demystifying GitHub AI: Models, Agents, Spaces, Spark, and More
vivo Internet Technology
vivo Internet Technology
Dec 10, 2025 · Big Data

Vivo’s 800‑Day Journey Optimizing Celeborn Remote Shuffle Service at PB Scale

This technical report details how Vivo’s big‑data platform adopted Celeborn as its remote shuffle service, evaluated alternatives, tuned hardware and software configurations, implemented performance and stability enhancements, and outlines future operational and community‑driven improvements for handling petabyte‑scale shuffle workloads.

Big DataKubernetesRemote Shuffle Service
0 likes · 20 min read
Vivo’s 800‑Day Journey Optimizing Celeborn Remote Shuffle Service at PB Scale
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 15, 2025 · Big Data

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

This article chronicles a ten‑year evolution of a self‑built big data platform—detailing early Hadoop clusters, successive migrations to Spark, Hive, Hudi, and StarRocks, the operational challenges encountered, and the comprehensive shift to Alibaba Cloud EMR Serverless that delivered significant cost, performance, and stability gains while outlining future intelligent‑ecosystem plans.

Big DataData LakeSpark
0 likes · 17 min read
From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse
Instant Consumer Technology Team
Instant Consumer Technology Team
Nov 10, 2025 · Big Data

Fixing Multi‑Version, Multi‑Cluster and HA with Apache Kyuubi for Spark/Flink

Apache Kyuubi, an enterprise‑grade multi‑tenant data gateway, replaces Livy and Flink SQL Gateway to support multiple engine versions, cross‑cluster elastic scheduling, high‑availability batch jobs, and traffic control, dramatically reducing deployment complexity, improving resource utilization, and accelerating release cycles for large‑scale Spark and Flink workloads.

Apache KyuubiBig DataData Gateway
0 likes · 18 min read
Fixing Multi‑Version, Multi‑Cluster and HA with Apache Kyuubi for Spark/Flink
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Oct 18, 2025 · Big Data

Alibaba Cloud EMR’s AI Evolution: Accelerating Big Data Performance

Since its 2016 launch, Alibaba Cloud EMR has transformed from a basic open‑source Hadoop service into a high‑performance, AI‑enabled big‑data platform, delivering optimized I/O, vectorized processing, and integrated AI functions such as natural‑language SQL, StarRocks and Spark enhancements, while supporting diverse industry workloads.

EMRSparkStarRocks
0 likes · 9 min read
Alibaba Cloud EMR’s AI Evolution: Accelerating Big Data Performance
DataFunSummit
DataFunSummit
Sep 21, 2025 · Big Data

Breaking the CPU Wall: BIGO’s Gluten Engine Accelerates Spark and Flink

When big‑data workloads hit the CPU wall, BIGO’s adoption of the open‑source Gluten project delivers native‑engine execution for Spark and a roadmap for Flink, achieving up to 30% end‑to‑end speedup, 50% memory savings, and a scalable, cost‑effective data processing platform.

Big DataFlinkGluten
0 likes · 16 min read
Breaking the CPU Wall: BIGO’s Gluten Engine Accelerates Spark and Flink
Architect's Must-Have
Architect's Must-Have
Sep 15, 2025 · Big Data

Mastering Spark Streaming Rate Control: A Deep Dive into Backpressure

This article explains Spark Streaming's rate control mechanisms, covering static limits, the dynamic back‑pressure feature introduced in Spark 1.5, the PID‑based estimator, RPC communication, and how Guava's token‑bucket RateLimiter enforces the calculated thresholds to ensure stability and optimal throughput.

RateControlSparkStreaming
0 likes · 13 min read
Mastering Spark Streaming Rate Control: A Deep Dive into Backpressure
Big Data Tech Team
Big Data Tech Team
Aug 25, 2025 · Interview Experience

Essential Big Data Interview Questions for Data Warehouse Engineer Roles

A comprehensive list of interview topics covering self‑introduction, career moves, data‑warehouse design, team building, architecture comparisons, fact‑table classification, common dimensions, performance tuning, and data‑governance for aspiring big‑data engineers.

Big DataData GovernanceFlink
0 likes · 4 min read
Essential Big Data Interview Questions for Data Warehouse Engineer Roles
Big Data Tech Team
Big Data Tech Team
Aug 24, 2025 · Big Data

Top 18 Data Warehouse Engineer Interview Questions from Meituan and ByteDance

This article compiles 18 essential interview topics for data warehouse engineer roles, covering self‑introduction, architecture layering, dimensional modeling, HDFS operations, Spark vs MapReduce, join implementation, SQL challenges, OLAP selection, real‑time quality assurance, and job transition considerations.

Data WarehouseHDFSOLAP
0 likes · 3 min read
Top 18 Data Warehouse Engineer Interview Questions from Meituan and ByteDance
Su San Talks Tech
Su San Talks Tech
Jul 17, 2025 · Big Data

How to De‑Duplicate 1 Billion QQ Numbers Using Under 1 GB of Memory

This article explores multiple techniques—including bitmap indexing, Bloom filters, external sorting, Spark, and layered bitmap structures—to efficiently remove duplicate QQ numbers from a dataset of up to one billion entries while keeping memory usage below a gigabyte and maintaining high accuracy.

BitmapDistributed SystemsSpark
0 likes · 12 min read
How to De‑Duplicate 1 Billion QQ Numbers Using Under 1 GB of Memory
Architect
Architect
Jul 7, 2025 · Big Data

How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×

This article explains how Baidu’s search data team redesigned its data warehouse with wide‑table modeling, Parquet columnar storage, and a Spark‑ClickHouse fusion engine, eliminating redundancy, cutting query latency from minutes to seconds, and enabling self‑service analytics for thousands of users.

Data WarehouseETLParquet
0 likes · 21 min read
How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 10, 2025 · Big Data

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

This article details how a leading automotive parts supply‑chain platform migrated from a traditional Hadoop stack to Alibaba Cloud EMR Serverless Spark and DataWorks, achieving faster, more elastic, and cost‑effective data processing, enhanced AI integration, and significant operational improvements across multiple business scenarios.

Big DataCloud NativeData Lake
0 likes · 12 min read
Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark
Big Data Tech Team
Big Data Tech Team
Apr 17, 2025 · Big Data

Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions

This article compiles a comprehensive set of Spark interview questions frequently asked by leading tech companies, providing detailed explanations of Spark’s performance mechanisms, architecture, RDD persistence, checkpointing, streaming, dependency types, HA setup, and practical coding examples to help data warehouse engineers prepare effectively.

Data WarehouseRDDSpark
0 likes · 21 min read
Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions
vivo Internet Technology
vivo Internet Technology
Apr 16, 2025 · Big Data

Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

The article explains how the vivo Internet Big Data team containerized offline Spark jobs and deployed them with the Spark Operator on a mixed online‑offline Kubernetes cluster, using elastic scheduling and resource‑over‑subscription to boost CPU utilization by 30‑40% and handle over 100,000 daily tasks.

Big DataKubernetesResource Management
0 likes · 36 min read
Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management
DataFunSummit
DataFunSummit
Apr 3, 2025 · Big Data

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

The Apache Hudi Asia technical salon held in Beijing on March 29 gathered over 230 on‑site participants and 16,000 online viewers, featuring expert talks from leading Chinese tech companies that showcased real‑world Hudi implementations, performance optimizations, and future roadmap for data‑lake technologies.

Apache HudiBig DataData Lake
0 likes · 13 min read
Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 27, 2025 · Big Data

Cost‑Effective Real‑Time Data Warehouse 2.0: Migrating from Kafka to Iceberg

iQIYI transformed its real‑time data warehouse by replacing a costly Kafka‑based Lambda stack with a unified stream‑batch Iceberg lake, cutting storage expenses by 90%, halving compute costs, extending data retention, and delivering minute‑level freshness for 90% of use cases while preserving second‑level processing where needed.

Cost OptimizationFlinkIceberg
0 likes · 11 min read
Cost‑Effective Real‑Time Data Warehouse 2.0: Migrating from Kafka to Iceberg
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 20, 2025 · Big Data

How to Read and Write StarRocks Data with EMR Serverless Spark

This step‑by‑step guide explains how to use EMR Serverless Spark together with the StarRocks Spark Connector to create a workspace, upload the connector JAR, configure network connections, create databases and tables in StarRocks, and perform read/write operations via SQL sessions, Notebook sessions, or batch Spark jobs, complete with code examples and UI screenshots.

Big DataData IntegrationSpark
0 likes · 14 min read
How to Read and Write StarRocks Data with EMR Serverless Spark
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 3, 2025 · Big Data

The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering

The article analyzes how the rapid rise of open‑source large‑model AI in 2025 is reshaping the data development profession, urging developers to transition from specialized data‑engineer roles to full‑stack AI data engineering skills such as distributed computing, lake‑house architectures, and model tuning.

AIBig DataFlink
0 likes · 7 min read
The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering
DataFunSummit
DataFunSummit
Feb 22, 2025 · Big Data

Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL

The article introduces Blaze, Kuaishou's Rust‑powered native execution engine that vectorizes Spark SQL workloads, explains its architecture and operation, presents benchmark results showing up to 50% latency reduction, and details internal deployments, industry case studies, community collaborations, and the 2025 roadmap.

Big DataNative ExecutionPerformance Optimization
0 likes · 12 min read
Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL
DataFunTalk
DataFunTalk
Feb 20, 2025 · Big Data

From Integrated Storage‑Compute to Decoupled Architecture: Practical Exploration of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue in Big Data Platforms

This article analyzes the transition from a tightly coupled storage‑compute architecture to a decoupled model, detailing how Kubernetes, Kyuubi, Celeborn, Blaze, and Hue together solve resource inefficiencies, improve scalability, and boost query performance in modern big‑data environments.

Big DataBlazeKubernetes
0 likes · 16 min read
From Integrated Storage‑Compute to Decoupled Architecture: Practical Exploration of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue in Big Data Platforms
21CTO
21CTO
Feb 4, 2025 · Big Data

Why Python Beats Java and Scala for Modern Data Engineering

The article compares Java, Scala, SQL, and Python for data‑engineering tasks, arguing that Python’s versatility, rich ecosystem, and ease of use make it the preferred language for both small‑scale and massive Spark workloads despite its performance trade‑offs.

Big DataSQLScala
0 likes · 7 min read
Why Python Beats Java and Scala for Modern Data Engineering
DataFunSummit
DataFunSummit
Feb 1, 2025 · Big Data

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

This article explains the challenges of big‑data processing in the cloud era, introduces Spark’s native‑language SQL engine rewrites, discusses vectorization and code generation techniques, describes cloud‑native storage‑compute separation with Remote Shuffle services such as Apache Celeborn, and presents the production benefits of Alibaba Cloud’s EMR Serverless Spark.

Big DataCodegenRemote Shuffle
0 likes · 12 min read
Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 26, 2025 · Big Data

How a FinTech Scaled Its Data Platform with Alibaba Cloud EMR Serverless Spark

Weifin, a fintech innovator, tackled massive data‑scale challenges by adopting Alibaba Cloud EMR Serverless Spark, building a unified Spark‑based platform that supports data collection, lake ingestion, distributed machine‑learning training, and intelligent risk‑control applications, while achieving performance gains, cost reduction, and scalable automation.

FinTechSparkmachine learning
0 likes · 10 min read
How a FinTech Scaled Its Data Platform with Alibaba Cloud EMR Serverless Spark
Airbnb Technology Team
Airbnb Technology Team
Jan 24, 2025 · Artificial Intelligence

Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning

Chronon is an open‑source framework that centralizes feature definitions to guarantee training‑inference consistency, eliminates complex ETL pipelines, and supports real‑time and batch processing across diverse data sources, cutting feature‑development cycles from months to under a week, as demonstrated by Airbnb’s 40,000‑feature deployment.

ChrononHiveSpark
0 likes · 10 min read
Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning
dbaplus Community
dbaplus Community
Jan 19, 2025 · Big Data

How to Write Elegant, High‑Performance SQL for Big Data Pipelines

This article shares practical techniques for writing clean, efficient SQL in large‑scale data environments, covering predicate pushdown, sub‑queries, deduplication strategies, bucket optimization, and automation with Python‑Spark integration to improve readability and execution speed.

HiveSparkoptimization
0 likes · 14 min read
How to Write Elegant, High‑Performance SQL for Big Data Pipelines
DataFunSummit
DataFunSummit
Jan 16, 2025 · Big Data

Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service

This article details Zhihu's comprehensive cost‑reduction and efficiency‑boosting initiatives for its big‑data platform, covering FinOps‑driven financial operations, hybrid‑cloud architecture, cost allocation models, operational monitoring, and technical optimizations such as erasure coding, ZSTD compression, Spark auto‑tuning, and a remote shuffle service.

Big DataCloud Cost ManagementCost Optimization
0 likes · 22 min read
Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service
DataFunSummit
DataFunSummit
Jan 14, 2025 · Big Data

Tencent Real-Time Lakehouse Intelligent Optimization Practice

This presentation details Tencent's real‑time lakehouse architecture and the four key topics—lakehouse design, intelligent optimization services, scenario‑driven capabilities, and future outlook—covering components such as Spark, Flink, Iceberg, Auto‑Optimize Service, indexing, clustering, AutoEngine, and PyIceberg implementations.

Auto OptimizeBig DataFlink
0 likes · 12 min read
Tencent Real-Time Lakehouse Intelligent Optimization Practice
DataFunSummit
DataFunSummit
Jan 3, 2025 · Big Data

Tencent Real‑Time Lakehouse Intelligent Optimization Practices

This article presents Tencent's end‑to‑end real‑time lakehouse architecture, detailing its three‑layer design, the Auto Optimize Service modules such as compaction, indexing, clustering and engine acceleration, as well as scenario‑driven capabilities like multi‑stream joins, primary‑key tables, in‑place migration and PyIceberg support, and concludes with future optimization directions.

Big DataFlinkIceberg
0 likes · 11 min read
Tencent Real‑Time Lakehouse Intelligent Optimization Practices
Bilibili Tech
Bilibili Tech
Jan 3, 2025 · Big Data

Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili

Bilibili replaced Spark’s unstable External Shuffle Service with a push‑based approach, then deployed Apache Celeborn’s remote shuffle on Kubernetes using HA masters, tiered workers, extensive monitoring, history‑based routing, chaos testing, and seamless Spark, Flink, and MapReduce integration, while planning self‑healing, elastic scaling, and priority‑aware I/O enhancements.

Apache CelebornBig DataFlink
0 likes · 28 min read
Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2025 · Big Data

Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details

This article provides a comprehensive overview of Apache Paimon, covering its real‑time lake ingestion, unified stream‑batch processing, table types (primary‑key and append‑only), LSM‑tree storage, bucket mechanisms, merge‑engine options, compaction strategies, concurrency control, consumption methods, tag management, data cleanup, and system tables for big‑data workloads.

Apache PaimonBig DataFlink
0 likes · 25 min read
Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 31, 2024 · Big Data

Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables

This article explains how Spark ≥ 3.3 introduces Storage Partitioned Join (SPJ) to avoid costly shuffle operations when joining partitioned V2 source tables such as Apache Iceberg, detailing the required conditions, configuration settings, practical code examples, and various join scenarios including mismatched partitions and data skew.

BucketingData SkewSQL
0 likes · 15 min read
Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables
JD Tech
JD Tech
Dec 30, 2024 · Big Data

Techniques for Writing Elegant and Efficient SQL in Big Data Environments

The article shares practical methods and code examples for making SQL both readable and high‑performing in large‑scale data platforms, covering predicate push‑down with subqueries, deduplication strategies, bucket utilization, and Python‑driven job parameter handling.

HiveSQLSpark
0 likes · 14 min read
Techniques for Writing Elegant and Efficient SQL in Big Data Environments
DataFunSummit
DataFunSummit
Dec 27, 2024 · Big Data

Tencent Real-time Lakehouse Intelligent Optimization Practice

This presentation describes Tencent's real-time lakehouse architecture, including data lake compute, management, and storage layers, and details the intelligent optimization services—such as compaction, indexing, clustering, and auto-engine—designed to improve query performance, storage cost, and operational efficiency for large-scale data processing.

AutoEngineFlinkIceberg
0 likes · 11 min read
Tencent Real-time Lakehouse Intelligent Optimization Practice
Bilibili Tech
Bilibili Tech
Dec 27, 2024 · Big Data

Consistency Architecture for Bilibili Recommendation Model Data Flow

The article outlines Bilibili’s revamped recommendation data‑flow architecture that eliminates timing and calculation inconsistencies by snapshotting online features, unifying feature computation in a single C++ library accessed via JNI, and orchestrating label‑join and sample extraction through near‑line Kafka/Flink pipelines, with further performance gains and Iceberg‑based future extensions.

Data ConsistencyFlinkIceberg
0 likes · 12 min read
Consistency Architecture for Bilibili Recommendation Model Data Flow
dbaplus Community
dbaplus Community
Dec 24, 2024 · Big Data

How Bilibili Scaled Its Tag System for Massive Data and Real‑Time Accuracy

The article details Bilibili's comprehensive redesign of its tag system—including background challenges, architectural layers, technical upgrades like Iceberg integration and shard‑based ClickHouse writes, crowd selection methods, online service guarantees, performance metrics, and future plans—showcasing a data‑driven solution that boosts stability, speed, and business coverage.

ClickHouseIcebergOnline Service
0 likes · 24 min read
How Bilibili Scaled Its Tag System for Massive Data and Real‑Time Accuracy
dbaplus Community
dbaplus Community
Dec 14, 2024 · Databases

Why a Database‑First Operating System Could Replace Linux and Kubernetes

The article examines the DBOS concept—a database‑oriented operating system that places a distributed, transactional database at the core of OS services, tracing its roots from early database pioneers to modern cloud workloads and highlighting its potential advantages over traditional Linux‑Kubernetes stacks.

DBOSOperating SystemSpark
0 likes · 10 min read
Why a Database‑First Operating System Could Replace Linux and Kubernetes
Qunar Tech Salon
Qunar Tech Salon
Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataHiveMapReduce
0 likes · 23 min read
Understanding and Solving Small File Problems in Hive and Spark
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Nov 27, 2024 · Big Data

Highlights of Tongcheng Travel’s 8th Big Data Technology Salon

The 8th Tongcheng Travel Big Data Technology Salon in Suzhou featured four expert talks covering Tencent Cloud’s Meson Spark engine, near‑line computing for travel itineraries, a Flink‑based real‑time risk control system, and Apache Paimon’s latest lake‑warehouse innovations, followed by a data‑driven business perspective session.

Apache PaimonBig DataData Lake
0 likes · 7 min read
Highlights of Tongcheng Travel’s 8th Big Data Technology Salon
Bilibili Tech
Bilibili Tech
Nov 12, 2024 · Big Data

Scalable Tag System Architecture and Optimization

The rebuilt tag system introduces a three‑layer architecture, standard pipelines, Iceberg‑backed storage and custom ClickHouse sharding, a DSL for crowd selection, and a stateless online service, achieving 99.9% success, sub‑5 ms latency, and supporting thousands of tags across dozens of business scenarios while planning real‑time processing and automated lifecycle management.

ClickHouseIcebergOnline Service
0 likes · 23 min read
Scalable Tag System Architecture and Optimization
Bilibili Tech
Bilibili Tech
Nov 1, 2024 · Big Data

Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform

Magnus is Bilibili’s self‑developed intelligent service that continuously optimizes Iceberg tables by scheduling snapshot expiration, orphan‑file cleanup, manifest rewriting, and multi‑dimensional data optimizations—including small‑file merging, sorting, distribution, and index creation—while automatically recommending configurations from real‑time query logs, delivering over 99.9% task success and up to 30% scan‑data reduction.

Data LakeIcebergIntelligent Recommendation
0 likes · 15 min read
Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform
Open Source Tech Hub
Open Source Tech Hub
Oct 31, 2024 · Big Data

How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark

Bilibili transformed its search indexing pipeline by replacing a manual, low‑throughput process with a distributed KV store (Taishan) and Spark‑based construction, achieving unified data ingestion, reduced resource consumption, faster full‑ and incremental builds, and a shift from daily to hourly indexing cycles.

Big DataKV StoreProtobuf
0 likes · 25 min read
How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark
DataFunSummit
DataFunSummit
Oct 24, 2024 · Big Data

Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment

This article details Bilibili’s implementation of a large‑language‑model‑driven intelligent assistant for its massive big‑data platform, covering background, problem analysis, architectural design, knowledge‑base construction, precision and recall challenges, deployment across offline and real‑time Spark/Flink diagnostics, and future outlooks.

AgentBig DataFlink
0 likes · 23 min read
Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment
Java Architecture Stack
Java Architecture Stack
Oct 18, 2024 · Big Data

How to Fix Spark OOM Errors: Practical Memory & Performance Tuning

This guide analyzes common Spark Out‑Of‑Memory scenarios—such as massive data volumes, data skew, and improper resource allocation—and provides step‑by‑step configurations, memory‑management tweaks, partitioning strategies, and shuffle optimizations to prevent OOM failures in production.

Big DataMemory TuningOOM
0 likes · 8 min read
How to Fix Spark OOM Errors: Practical Memory & Performance Tuning
DataFunSummit
DataFunSummit
Sep 30, 2024 · Big Data

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

This article introduces Apache Hudi Streamer, a versatile Spark‑based data ingestion tool likened to a Swiss Army knife, detailing its core options—including table configuration, continuous mode, source classes, transformers, table services, catalog synchronization, and advanced features—while guiding users on practical pipeline setup.

Apache HudiBig DataSpark
0 likes · 10 min read
Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)
Architect
Architect
Sep 24, 2024 · Industry Insights

How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround

This article details Bilibili's transformation of its search offline indexing architecture—from a manual, low‑throughput MySQL‑centric process to a distributed, KV‑based, protobuf‑driven pipeline that leverages Taishan storage and Spark, cutting build cycles from days to hours while solving performance, consistency, and maintenance challenges.

Big DataDistributed SystemsProtobuf
0 likes · 24 min read
How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround
Kuaishou Tech
Kuaishou Tech
Sep 13, 2024 · Big Data

Blaze: Kuaishou’s Rust‑Based Vectorized Execution Engine for Spark SQL

Blaze is a Rust‑implemented, DataFusion‑based vectorized execution engine created by Kuaishou to accelerate Spark SQL queries, delivering up to 60% faster computation, 30% average compute‑power gains in production, and extensive architectural innovations such as native engine, protobuf protocol, JNI bridge, and Spark extension, while being open‑source and compatible with Spark 3.0‑3.5.

Big DataDataFusionRust
0 likes · 11 min read
Blaze: Kuaishou’s Rust‑Based Vectorized Execution Engine for Spark SQL
dbaplus Community
dbaplus Community
Sep 4, 2024 · Big Data

How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

This article details how Ctrip’s data platform evolved from a single‑IDC design to a multi‑IDC, tiered storage and scheduling architecture, covering the challenges of rapid data growth, the migration to Spark 3 via Kyuubi, the introduction of Celeborn shuffle service, and the resulting performance and reliability gains.

Big DataHDFSKyuubi
0 likes · 23 min read
How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn
DataFunSummit
DataFunSummit
Aug 17, 2024 · Big Data

AnalyticDB Spark Architecture and Vectorized Engine Performance Overview

This article introduces the AnalyticDB Spark architecture, explains the need for Spark vectorization, surveys industry vectorized solutions, details ADB Spark's own vectorized implementation with Gluten and Velox, and presents performance test results showing a 6.98‑fold speedup over open‑source Spark.

AnalyticDBBig DataGluten
0 likes · 9 min read
AnalyticDB Spark Architecture and Vectorized Engine Performance Overview
Bilibili Tech
Bilibili Tech
Aug 13, 2024 · Big Data

How Bilibili Re‑engineered Its Search Indexing with Distributed Storage and Spark

This article details Bilibili's transformation of its search offline indexing pipeline, moving from manual MySQL‑based processes to a high‑capacity, distributed KV store and Spark‑driven builds, addressing performance, maintenance, and scalability challenges while improving resource efficiency and iteration speed.

Big DataBilibiliKV Store
0 likes · 24 min read
How Bilibili Re‑engineered Its Search Indexing with Distributed Storage and Spark
DataFunTalk
DataFunTalk
Jul 23, 2024 · Big Data

Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms

This article shares detailed practical experiences from DingXiangYuan's big‑data platform on using Apache Kyuubi and Apache Celeborn, covering architecture, flexible configuration, AuthZ fine‑grained permissions, small‑file and Z‑Order optimizations, Arrow‑based large result transmission, and operational tips such as connection‑level issues and Netty cache handling.

Apache CelebornApache KyuubiArrow
0 likes · 17 min read
Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms
360 Smart Cloud
360 Smart Cloud
Jul 9, 2024 · Big Data

Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)

This article examines the critical role of shuffle in big‑data processing, compares Spark's native shuffle with the External Shuffle Service (ESS) and Remote Shuffle Service (RSS) solutions, introduces Uniffle's architecture and configuration, and shares practical deployment experiences and performance results within a 360 internal environment.

Big DataExternal Shuffle ServiceRemote Shuffle Service
0 likes · 15 min read
Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)
Baidu Geek Talk
Baidu Geek Talk
Jul 8, 2024 · Big Data

Evolution of Feed Data Warehouse Wide-Table Modeling at Baidu App

Baidu’s Mobile Ecology team transformed its Feed data warehouse through three progressive stages—hour‑level core tables, a real‑time wide table, and a unified day‑level multi‑version table—consolidating traffic, content, and user data into a single partitioned wide‑table architecture that resolves granularity inconsistencies, cuts processing cost, and delivers real‑time to daily latency for diverse analytics.

Real-TimeSparkWide Table
0 likes · 10 min read
Evolution of Feed Data Warehouse Wide-Table Modeling at Baidu App
DataFunTalk
DataFunTalk
Jun 28, 2024 · Big Data

Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation

This article presents a comprehensive technical overview of using ClickHouse as a native backend to accelerate Spark SQL execution, covering Spark performance bottlenecks, ClickHouse's CPU‑level optimizations, the design and implementation of the Spark‑Native integration, and detailed TPC‑DS benchmark results demonstrating up to 3.5× speedup.

Big DataClickHouseNative Execution
0 likes · 33 min read
Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation
Baidu Geek Talk
Baidu Geek Talk
Jun 24, 2024 · Big Data

Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

The paper presents a Spark acceleration framework that replaces Java‑based task operators with a ClickHouse native library, converting plans via Protobuf and JNI, leveraging columnar storage, SIMD and JIT to achieve up to 3× speed‑up on TPC‑DS workloads while providing fallback mechanisms to ensure no performance loss.

Big DataClickHouseNative Acceleration
0 likes · 31 min read
Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jun 24, 2024 · Big Data

Boost Spark Performance with ClickHouse: Native Acceleration Techniques

This article presents a detailed technical overview of accelerating Spark's compute engine using ClickHouse as a native backend, covering Spark performance background, ClickHouse's advantages, the design and implementation of a Spark‑Native acceleration solution, and extensive performance evaluation results.

ClickHouseNative AccelerationPerformance Optimization
0 likes · 34 min read
Boost Spark Performance with ClickHouse: Native Acceleration Techniques
DataFunTalk
DataFunTalk
Jun 22, 2024 · Big Data

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

This article details Zhihu's migration of massive Spark and MapReduce shuffle workloads from the External Shuffle Service (ESS) to a push‑based Remote Shuffle Service (RSS) powered by Celeborn, covering background problems, evaluation of open‑source implementations, deployment architecture, encountered issues, solutions, performance gains, and future plans.

Big DataRSSShuffle
0 likes · 19 min read
Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits
Tencent Cloud Developer
Tencent Cloud Developer
May 29, 2024 · Artificial Intelligence

Distributed Network Embedding Algorithm for Billion‑Scale Graph Data in Tencent Games

Tencent’s Game Social Algorithm Team presents a Spark‑based distributed network embedding framework that recursively partitions hundred‑billion‑edge game graphs into manageable subgraphs, runs node2vec locally, and fuses results, enabling efficient link prediction and node classification across multiple games within hours.

Game AnalyticsSparkdistributed computing
0 likes · 7 min read
Distributed Network Embedding Algorithm for Billion‑Scale Graph Data in Tencent Games
DataFunSummit
DataFunSummit
May 27, 2024 · Big Data

Design and Optimization of Zhihu's Bridge Platform for DMP/CDP: Architecture, Challenges, and Solutions

This article presents a comprehensive case study of Zhihu's Bridge platform, detailing its background, five core modules, unified architecture built on Spark and Flink, bitmap‑based tagging, and performance optimizations that address query speed, write latency, and high‑QPS online checks while outlining future directions with Doris 2.0 and large language models.

CDPDMPData Platform
0 likes · 27 min read
Design and Optimization of Zhihu's Bridge Platform for DMP/CDP: Architecture, Challenges, and Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
May 27, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

The Athena Data Factory, built by Spark Thinking, is a comprehensive one‑stop data development and governance platform that integrates data integration, development, analysis, and services, offering offline, real‑time, and AI pipelines, modular architecture, extensive monitoring, and cost‑optimisation to empower thousands of users across the company.

AirflowBig DataData Platform
0 likes · 26 min read
Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact
DataFunTalk
DataFunTalk
May 26, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking

The article details how Sparkle Thinking built the Athena Data Factory—a comprehensive, self‑service data development and governance platform that integrates data integration, ETL, real‑time processing, monitoring, and analytics, describing its architecture, key technologies, implementation timeline, operational practices, performance gains, and future directions.

AirflowETLFlink
0 likes · 26 min read
Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking
Big Data Technology & Architecture
Big Data Technology & Architecture
May 13, 2024 · Big Data

Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements

The article introduces Apache Paimon 0.8, highlighting new Deletion Vectors, a universal file index, memory and I/O optimizations, record‑level TTL, and integration improvements with Flink and Spark, while also discussing broader lake‑house performance trends and future directions.

Apache PaimonBig DataDeletion Vectors
0 likes · 8 min read
Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements
DataFunSummit
DataFunSummit
Apr 25, 2024 · Big Data

Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap

This article presents a comprehensive overview of the Apache‑incubated Paimon project, covering its evolution from Flink Table Store, the current features of primary‑key and log tables, management tools such as snapshots, tags and branches, performance optimizations for Flink and Spark, and a detailed roadmap of upcoming functionalities.

Big DataData ManagementFlink
0 likes · 23 min read
Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap
StarRocks
StarRocks
Mar 26, 2024 · Big Data

How Replacing Spark with StarRocks Cut Data Refresh Time by 90% and Saved 99% Cost

The article details how the Xiaohongshu data warehouse team integrated StarRocks into their offline processing pipeline, replacing Spark for heavy Cube calculations, which reduced job execution from hours to minutes, cut resource consumption by over 90%, advanced daily data output by 1.5 hours, and lowered refresh cost by more than 99%.

Big DataOLAPPerformance Optimization
0 likes · 18 min read
How Replacing Spark with StarRocks Cut Data Refresh Time by 90% and Saved 99% Cost
DataFunSummit
DataFunSummit
Mar 20, 2024 · Big Data

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's large‑scale evolution of Spark Shuffle to a cloud‑native architecture, describing background, stability and mixed‑resource scenarios, challenges such as CPU and I/O limits, custom ESS enhancements, shuffle throttling, spill‑split mechanisms, and the Cloud Shuffle Service with its push‑based design and performance gains.

Big DataKubernetesPerformance Optimization
0 likes · 21 min read
Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance
DataFunSummit
DataFunSummit
Feb 29, 2024 · Big Data

Trino at Xiaomi: Architecture, Practices, and Future Plans

This article details Xiaomi’s practical deployment of Trino, covering its architectural role, core and extended capabilities, performance comparisons, integration with Iceberg and Spark, operational enhancements, multi‑cluster and ad‑hoc query scenarios, future cloud‑storage plans, and a Q&A session.

Big DataIcebergOLAP
0 likes · 20 min read
Trino at Xiaomi: Architecture, Practices, and Future Plans
Baidu Tech Salon
Baidu Tech Salon
Feb 28, 2024 · Big Data

Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse

Baidu’s Fusion Compute Engine, built on Spark with a one‑layer wide‑table model, combines data‑skipping, push‑down, code‑generation, vectorization and extensive tuning to cut ad‑hoc query latency to seconds, shrink storage by ~30 %, and accelerate ETL workloads while maintaining stability for massive data‑warehouse workloads.

BaiduBig DataFusion Compute Engine
0 likes · 10 min read
Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse
Baidu Geek Talk
Baidu Geek Talk
Feb 28, 2024 · Big Data

How Baidu’s Fusion Compute Engine Cuts Query Time to Seconds on Petabyte‑Scale Data

This article analyzes Baidu's fusion compute engine for its data warehouse, detailing its architecture, optimization techniques such as data skipping, Parquet column indexing, ProjectLimit and CodeGen, and demonstrates how these innovations reduce query latency to seconds while cutting storage costs by about 30% on multi‑petabyte workloads.

BaiduBig DataData Warehouse
0 likes · 12 min read
How Baidu’s Fusion Compute Engine Cuts Query Time to Seconds on Petabyte‑Scale Data