Tagged articles
37 articles
Page 1 of 1
Big Data Technology Tribe
Big Data Technology Tribe
Aug 5, 2025 · Big Data

How Spark’s Catalyst Optimizer Transforms SQL Queries: Trees, Rules, and Code Generation

This article explains Spark SQL’s Catalyst optimizer, describing its extensible design, tree‑based representation, rule‑driven transformations, batch execution to a fixed point, and how Scala’s pattern matching and quasiquotes enable efficient analysis, logical optimization, physical planning, and code generation.

Big DataCatalyst OptimizerCode Generation
0 likes · 18 min read
How Spark’s Catalyst Optimizer Transforms SQL Queries: Trees, Rules, and Code Generation
Sohu Tech Products
Sohu Tech Products
Jun 11, 2025 · Big Data

How We Transformed a Microservice Finance System into a Scalable Big Data Warehouse

This article details the evolution of a fast‑growing finance reporting system from a monolithic microservice architecture plagued by data inconsistency, low efficiency, and scalability limits to a robust, high‑performance big‑data warehouse built with layered data models, SparkSQL processing, and unified scheduling, highlighting design decisions, technical trade‑offs, and measurable performance gains.

Data WarehouseMicroservicesSpark SQL
0 likes · 23 min read
How We Transformed a Microservice Finance System into a Scalable Big Data Warehouse
Zhuanzhuan Tech
Zhuanzhuan Tech
May 21, 2025 · Big Data

How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse

This article details the evolution of a fast‑growing e‑commerce finance platform from a monolithic microservice architecture plagued by data inconsistency, low processing efficiency, and scalability limits to a robust, distributed big‑data warehouse using SparkSQL, layered data models, and optimized scheduling, achieving ten‑fold performance gains and near‑zero failure rates.

Big DataData WarehouseETL
0 likes · 21 min read
How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse
DataFunSummit
DataFunSummit
Mar 12, 2025 · Big Data

Principles and Common Optimization Techniques of the Spark SQL Optimizer

This article explains the underlying principles of the Spark SQL optimizer and presents three classic optimization paradigms—push‑down optimization, operator elimination/merging, and expression elimination/replacement—illustrating each with concrete rule implementations and code examples.

Big DataSpark SQLoptimizer
0 likes · 12 min read
Principles and Common Optimization Techniques of the Spark SQL Optimizer
DataFunSummit
DataFunSummit
Jan 9, 2025 · Big Data

Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A

This article explains Spark SQL's window function fundamentals, introduces two key optimizations—Offset Window Frame and Infer Window Group Limit—and provides a detailed Q&A covering implementation details, execution plan impacts, and underlying architecture.

Apache SparkBig DataSQL Performance
0 likes · 13 min read
Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A
DataFunSummit
DataFunSummit
Dec 9, 2024 · Big Data

Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding

This article examines Spark SQL expression-level optimizations, focusing on redesigning LIKE ALL and LIKE ANY to reduce memory and stack usage, refactoring the TRIM function for better code reuse and performance, and implementing constant folding to cache computed constant expressions, thereby enhancing query efficiency in big-data workloads.

Big DataExpression OptimizationSpark SQL
0 likes · 16 min read
Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding
DataFunSummit
DataFunSummit
Nov 11, 2024 · Big Data

Understanding Spark SQL Parsing Layer and Its Optimizations

This talk, the third in a Spark series, introduces the Spark SQL parsing layer, explains its architecture and integration with ANTLR4, details core implementation classes, and presents a real‑world optimization case that reduces code complexity and improves maintainability.

Antlr4Big DataScala
0 likes · 15 min read
Understanding Spark SQL Parsing Layer and Its Optimizations
DataFunSummit
DataFunSummit
Aug 1, 2024 · Big Data

Deep Dive into Apache Spark SQL: Concepts, Core Components, and API

This article provides a comprehensive overview of Apache Spark SQL, covering its fundamental concepts such as TreeNode, AST, and QueryPlan, the distinction between logical and physical plans, the rule‑execution framework, core components like SparkSqlParser and Analyzer, as well as the Spark Session, Dataset/DataFrame, and various writer APIs, supplemented by a detailed Q&A session.

Apache SparkBig DataSQL Optimization
0 likes · 19 min read
Deep Dive into Apache Spark SQL: Concepts, Core Components, and API
DataFunTalk
DataFunTalk
Apr 9, 2024 · Big Data

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

Apache SparkBig DataData Migration
0 likes · 14 min read
Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform
DataFunSummit
DataFunSummit
Jun 16, 2023 · Big Data

Apache Kyuubi Practices and Service Evolution at iQIYI

This article details iQIYI's implementation of Apache Kyuubi for Spark Thrift Server, covering the evolution from native Spark Thrift to Kyuubi 0.7 and 1.x, multi‑tenant architecture, tag‑based configurations, SQL auditing, lineage collection, service monitoring, small‑file and Z‑order optimizations, and a brief Q&A.

Apache KyuubiData PlatformSpark SQL
0 likes · 15 min read
Apache Kyuubi Practices and Service Evolution at iQIYI
iQIYI Technical Product Team
iQIYI Technical Product Team
Jun 9, 2023 · Big Data

Accelerating iQIYI Big Data Platform: Migrating from Hive to Spark SQL

iQIYI accelerated its big‑data platform by migrating the OLAP layer from Hive to Spark SQL, achieving a 67 % speedup, 50 % CPU reduction and 44 % memory savings, while automating the conversion of tens of thousands of tasks and delivering faster analytics for advertising, BI, membership and user‑growth services.

Data MigrationHivePerformance Optimization
0 likes · 18 min read
Accelerating iQIYI Big Data Platform: Migrating from Hive to Spark SQL
DataFunSummit
DataFunSummit
Jan 22, 2023 · Big Data

Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned

This article details how Ping An Insurance migrated its offline Hive SQL workloads to Spark SQL, covering business background, deployment mode selection, migration workflow, typical challenges, optimization measures, and the resulting performance and resource utilization improvements.

Big DataCluster MigrationDeployment Modes
0 likes · 16 min read
Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned
DataFunTalk
DataFunTalk
Jul 16, 2022 · Big Data

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

The article provides an in‑depth overview of Apache Hudi 0.11.0, covering its new multi‑level index design, Spark SQL enhancements, Flink integration improvements, and additional performance and usability features aimed at boosting read/write efficiency in large‑scale data lake environments.

Apache HudiBig DataData Lake
0 likes · 15 min read
Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements
vivo Internet Technology
vivo Internet Technology
Apr 20, 2022 · Big Data

Implementing Field Lineage in Spark SQL: A Technical Deep Dive

The article details how to add field‑lineage tracking to Spark SQL by creating a custom SparkSessionExtension that injects a check‑analysis rule and a parser, which capture INSERT statements, analyze the physical plan, and generate a JSON mapping of source‑to‑target fields for data governance.

Data GovernanceData QualityField Lineage
0 likes · 9 min read
Implementing Field Lineage in Spark SQL: A Technical Deep Dive
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 28, 2021 · Big Data

Comprehensive Guide to Spark SQL: Concepts, DataSet/DataFrame, Functions, Optimization and Common Pitfalls

This article provides an in‑depth overview of Spark SQL, covering its architecture, DataSet/DataFrame creation, DSL and SQL usage, integration with Hive, custom UDF/UDAF/Aggregator implementations, handling of small files, Cartesian product detection, and a catalog of useful built‑in functions and window operations.

Big DataDatasetHive
0 likes · 29 min read
Comprehensive Guide to Spark SQL: Concepts, DataSet/DataFrame, Functions, Optimization and Common Pitfalls
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 15, 2021 · Big Data

Spark SQL Interview Guide: Concepts, APIs, Optimization and Common Pitfalls

This article provides a comprehensive overview of Spark SQL, covering its architecture, DataSet/DataFrame APIs, code examples for creating and querying datasets, join strategy selection, handling Hive tables, small‑file issues, inefficient NOT‑IN subqueries, Cartesian products, and a catalog of useful built‑in functions.

DatasetHive IntegrationPerformance Optimization
0 likes · 40 min read
Spark SQL Interview Guide: Concepts, APIs, Optimization and Common Pitfalls
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 4, 2021 · Big Data

Comprehensive Guide to Learning Apache Spark: Background, Core Concepts, Modules, Resources, and Optimization

This article provides a thorough learning roadmap for Apache Spark, covering its background papers, core concepts such as RDD and fault tolerance, module breakdown, recommended books and repositories, source‑code reading tips, hands‑on projects, and interview‑oriented optimization guidance.

Apache SparkLearning PathPerformance Optimization
0 likes · 15 min read
Comprehensive Guide to Learning Apache Spark: Background, Core Concepts, Modules, Resources, and Optimization
Big Data Technology Architecture
Big Data Technology Architecture
Apr 8, 2021 · Big Data

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.

Big DataHadoopSmall Files
0 likes · 12 min read
Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 5, 2021 · Big Data

Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains

This article details a real‑world investigation of Spark SQL job latency on a YARN cluster, explains how switching the scheduler to FAIR mode, creating resource pools, and consolidating small Parquet files dramatically reduced scheduler delay and cut execution time from over 100 seconds to under 20 seconds.

ParquetPerformance OptimizationScheduler
0 likes · 13 min read
Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains
Big Data Technology Architecture
Big Data Technology Architecture
Aug 5, 2020 · Big Data

Understanding Join Execution in Spark SQL

This article explains how Spark SQL processes joins—including inner, outer, semi, and anti joins—by describing the overall query planning flow, the three physical join strategies (sort‑merge, broadcast, and hash), and the specific implementation details for each join type.

DataFramesJOINSQL Optimization
0 likes · 10 min read
Understanding Join Execution in Spark SQL
DataFunTalk
DataFunTalk
Nov 13, 2019 · Big Data

ByteDance’s Core Optimization Practices on Spark SQL

ByteDance’s data warehouse team shares comprehensive optimizations for Spark SQL, covering architecture overview, bucket join enhancements, materialized columns and views, and shuffle stability and performance improvements, illustrating practical techniques that boost query efficiency and job reliability in large‑scale big‑data environments.

Big DataMaterialized ColumnsShuffle Optimization
0 likes · 20 min read
ByteDance’s Core Optimization Practices on Spark SQL
Big Data Technology Architecture
Big Data Technology Architecture
Jul 10, 2019 · Big Data

Introduction to Apache Spark and Its Core Components

Apache Spark, an open‑source unified analytics engine from UC Berkeley’s AMP Lab, is the leading platform for large‑scale batch and streaming data processing, featuring components such as Spark SQL, Streaming, GraphX, MLlib, and core modules like DAGScheduler, TaskScheduler and BlockManager.

Apache SparkBlockManagerDAGScheduler
0 likes · 4 min read
Introduction to Apache Spark and Its Core Components
dbaplus Community
dbaplus Community
Sep 26, 2017 · Big Data

How to Avoid Common Spark SQL Pitfalls and Boost Performance

This article shares a comprehensive set of practical tips and solutions for common Spark SQL issues—including out‑of‑memory errors, UDF‑induced GC, thread blocking, system‑property initialization, speculation side‑effects, accumulator traps, concurrent job scheduling, and excessive logging—helping engineers improve stability and efficiency of their Spark‑based financial systems.

AccumulatorMemory ManagementSpark
0 likes · 15 min read
How to Avoid Common Spark SQL Pitfalls and Boost Performance
ITPUB
ITPUB
Mar 22, 2017 · Big Data

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

This article walks through Spark’s origins, its core RDD concept, how it improves on Hadoop’s MapReduce, the role of in‑memory processing, functional programming support, and the emergence of Spark SQL with DataFrames and the Catalyst optimizer.

Big DataMapReduceRDD
0 likes · 25 min read
Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution
Architecture Digest
Architecture Digest
May 25, 2016 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article provides a comprehensive guide on tackling Spark performance bottlenecks by diagnosing data skew, locating the offending stages and operators, and applying a range of practical solutions—including Hive pre‑processing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, and combined strategies—followed by an in‑depth discussion of shuffle manager evolution and key configuration parameters for fine‑tuning.

Big DataData SkewShuffle Optimization
0 likes · 35 min read
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning