Tagged articles

Spark SQL

37 articles · Page 1 of 1

Nov 18, 2025 · Big Data

Master Spark SQL: From DataFrames to Catalyst Optimization and Real-World Use Cases

This comprehensive guide walks you through Spark SQL fundamentals—including DataFrame and Dataset APIs—delves into the Catalyst optimizer and Tungsten engine, presents practical Java examples, and shares concrete tuning techniques and real-world ETL scenarios for handling large‑scale data.

CatalystETLOptimization

0 likes · 8 min read

Master Spark SQL: From DataFrames to Catalyst Optimization and Real-World Use Cases

Instant Consumer Technology Team

Oct 14, 2025 · Big Data

How to Boost Spark SQL DAG Efficiency with Regex‑Driven Temporary Views

This article explains how to reduce intermediate tables, simplify dependencies, and improve execution efficiency in Spark SQL pipelines by using session‑level temporary views and regex‑based SQL parsing to automatically merge and rewrite DAG tasks in large‑scale data platforms.

Big DataDAG OptimizationETL

0 likes · 13 min read

How to Boost Spark SQL DAG Efficiency with Regex‑Driven Temporary Views

Big Data Technology Tribe

Aug 5, 2025 · Big Data

How Spark’s Catalyst Optimizer Transforms SQL Queries: Trees, Rules, and Code Generation

This article explains Spark SQL’s Catalyst optimizer, describing its extensible design, tree‑based representation, rule‑driven transformations, batch execution to a fixed point, and how Scala’s pattern matching and quasiquotes enable efficient analysis, logical optimization, physical planning, and code generation.

Big DataCatalyst OptimizerQuery Optimization

0 likes · 18 min read

How Spark’s Catalyst Optimizer Transforms SQL Queries: Trees, Rules, and Code Generation

Sohu Tech Products

Jun 11, 2025 · Big Data

How We Transformed a Microservice Finance System into a Scalable Big Data Warehouse

This article details the evolution of a fast‑growing finance reporting system from a monolithic microservice architecture plagued by data inconsistency, low efficiency, and scalability limits to a robust, high‑performance big‑data warehouse built with layered data models, SparkSQL processing, and unified scheduling, highlighting design decisions, technical trade‑offs, and measurable performance gains.

Data WarehouseMicroservicesSpark SQL

0 likes · 23 min read

How We Transformed a Microservice Finance System into a Scalable Big Data Warehouse

Zhuanzhuan Tech

May 21, 2025 · Big Data

How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse

This article details the evolution of a fast‑growing e‑commerce finance platform from a monolithic microservice architecture plagued by data inconsistency, low processing efficiency, and scalability limits to a robust, distributed big‑data warehouse using SparkSQL, layered data models, and optimized scheduling, achieving ten‑fold performance gains and near‑zero failure rates.

Big DataData WarehouseETL

0 likes · 21 min read

How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse

DataFunSummit

Mar 12, 2025 · Big Data

Principles and Common Optimization Techniques of the Spark SQL Optimizer

This article explains the underlying principles of the Spark SQL optimizer and presents three classic optimization paradigms—push‑down optimization, operator elimination/merging, and expression elimination/replacement—illustrating each with concrete rule implementations and code examples.

Big DataQuery OptimizationRule Engine

0 likes · 12 min read

Principles and Common Optimization Techniques of the Spark SQL Optimizer

DataFunSummit

Jan 9, 2025 · Big Data

Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A

This article explains Spark SQL's window function fundamentals, introduces two key optimizations—Offset Window Frame and Infer Window Group Limit—and provides a detailed Q&A covering implementation details, execution plan impacts, and underlying architecture.

Apache SparkBig DataOptimization

0 likes · 13 min read

Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A

DataFunSummit

Dec 9, 2024 · Big Data

Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding

This article examines Spark SQL expression-level optimizations, focusing on redesigning LIKE ALL and LIKE ANY to reduce memory and stack usage, refactoring the TRIM function for better code reuse and performance, and implementing constant folding to cache computed constant expressions, thereby enhancing query efficiency in big-data workloads.

Big DataExpression OptimizationPerformance Tuning

0 likes · 16 min read

Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding

DataFunSummit

Nov 11, 2024 · Big Data

Understanding Spark SQL Parsing Layer and Its Optimizations

This talk, the third in a Spark series, introduces the Spark SQL parsing layer, explains its architecture and integration with ANTLR4, details core implementation classes, and presents a real‑world optimization case that reduces code complexity and improves maintainability.

Antlr4Big DataOptimization

0 likes · 15 min read

Understanding Spark SQL Parsing Layer and Its Optimizations

DataFunSummit

Aug 1, 2024 · Big Data

Deep Dive into Apache Spark SQL: Concepts, Core Components, and API

This article provides a comprehensive overview of Apache Spark SQL, covering its fundamental concepts such as TreeNode, AST, and QueryPlan, the distinction between logical and physical plans, the rule‑execution framework, core components like SparkSqlParser and Analyzer, as well as the Spark Session, Dataset/DataFrame, and various writer APIs, supplemented by a detailed Q&A session.

Apache SparkBig DataDistributed Computing

0 likes · 19 min read

Deep Dive into Apache Spark SQL: Concepts, Core Components, and API

DataFunTalk

Apr 9, 2024 · Big Data

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

Apache SparkBig DataData Migration

0 likes · 14 min read

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

Big Data Technology & Architecture

Jul 17, 2023 · Big Data

Incremental Query of Hudi Tables Using Hive, Spark SQL, and Flink SQL

This guide explains how to perform incremental queries on Hudi tables by configuring Hive synchronization, using Spark SQL both programmatically and via pure SQL, and leveraging Flink SQL in batch and streaming modes, with detailed parameter settings and code examples.

Big DataFlink SQLHive

0 likes · 20 min read

Incremental Query of Hudi Tables Using Hive, Spark SQL, and Flink SQL

DataFunSummit

Jun 16, 2023 · Big Data

Apache Kyuubi Practices and Service Evolution at iQIYI

This article details iQIYI's implementation of Apache Kyuubi for Spark Thrift Server, covering the evolution from native Spark Thrift to Kyuubi 0.7 and 1.x, multi‑tenant architecture, tag‑based configurations, SQL auditing, lineage collection, service monitoring, small‑file and Z‑order optimizations, and a brief Q&A.

Apache KyuubiData PlatformSpark SQL

0 likes · 15 min read

Apache Kyuubi Practices and Service Evolution at iQIYI

iQIYI Technical Product Team

Jun 9, 2023 · Big Data

Accelerating iQIYI Big Data Platform: Migrating from Hive to Spark SQL

iQIYI accelerated its big‑data platform by migrating the OLAP layer from Hive to Spark SQL, achieving a 67 % speedup, 50 % CPU reduction and 44 % memory savings, while automating the conversion of tens of thousands of tasks and delivering faster analytics for advertising, BI, membership and user‑growth services.

Data MigrationHivePerformance Optimization

0 likes · 18 min read

Accelerating iQIYI Big Data Platform: Migrating from Hive to Spark SQL

DataFunSummit

Jan 22, 2023 · Big Data

Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned

This article details how Ping An Insurance migrated its offline Hive SQL workloads to Spark SQL, covering business background, deployment mode selection, migration workflow, typical challenges, optimization measures, and the resulting performance and resource utilization improvements.

Big DataCluster MigrationDeployment Modes

0 likes · 16 min read

Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned

DataFunTalk

Jul 16, 2022 · Big Data

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

The article provides an in‑depth overview of Apache Hudi 0.11.0, covering its new multi‑level index design, Spark SQL enhancements, Flink integration improvements, and additional performance and usability features aimed at boosting read/write efficiency in large‑scale data lake environments.

Apache HudiBig DataData Lake

0 likes · 15 min read

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

Snowball Engineer Team

Apr 21, 2022 · Big Data

Migrating from Hive3 on Tez to Spark SQL: Practices, Challenges, and Performance Evaluation

This article details the Snowball data team's migration from Hive3 on Tez to Spark SQL, covering the motivations, comparative performance tests, encountered compatibility issues, configuration work‑arounds, and future plans for consolidating ETL workloads on Spark.

Big DataData WarehouseETL

0 likes · 13 min read

Migrating from Hive3 on Tez to Spark SQL: Practices, Challenges, and Performance Evaluation

vivo Internet Technology

Apr 20, 2022 · Big Data

Implementing Field Lineage in Spark SQL: A Technical Deep Dive

The article details how to add field‑lineage tracking to Spark SQL by creating a custom SparkSessionExtension that injects a check‑analysis rule and a parser, which capture INSERT statements, analyze the physical plan, and generate a JSON mapping of source‑to‑target fields for data governance.

Data GovernanceData QualityField Lineage

0 likes · 9 min read

Implementing Field Lineage in Spark SQL: A Technical Deep Dive

Big Data Technology & Architecture

Feb 28, 2022 · Big Data

Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

This article provides a step‑by‑step guide on integrating Apache Hudi with Hive and Presto, demonstrates core Hudi operations such as insert, upsert, delete, query, and Hive synchronization using Scala code, and shows how to manage Hudi tables through Spark SQL DDL/DML commands.

Apache HudiBig DataData Lake

0 likes · 16 min read

Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

Big Data Technology & Architecture

Dec 28, 2021 · Big Data

Comprehensive Guide to Spark SQL: Concepts, DataSet/DataFrame, Functions, Optimization and Common Pitfalls

This article provides an in‑depth overview of Spark SQL, covering its architecture, DataSet/DataFrame creation, DSL and SQL usage, integration with Hive, custom UDF/UDAF/Aggregator implementations, handling of small files, Cartesian product detection, and a catalog of useful built‑in functions and window operations.

Big DataHiveSpark SQL

0 likes · 29 min read

Comprehensive Guide to Spark SQL: Concepts, DataSet/DataFrame, Functions, Optimization and Common Pitfalls

Big Data Technology & Architecture

Aug 15, 2021 · Big Data

Spark SQL Interview Guide: Concepts, APIs, Optimization and Common Pitfalls

This article provides a comprehensive overview of Spark SQL, covering its architecture, DataSet/DataFrame APIs, code examples for creating and querying datasets, join strategy selection, handling Hive tables, small‑file issues, inefficient NOT‑IN subqueries, Cartesian products, and a catalog of useful built‑in functions.

Hive IntegrationPerformance OptimizationSpark SQL

0 likes · 40 min read

Spark SQL Interview Guide: Concepts, APIs, Optimization and Common Pitfalls

Big Data Technology & Architecture

Jul 4, 2021 · Big Data

Comprehensive Guide to Learning Apache Spark: Background, Core Concepts, Modules, Resources, and Optimization

This article provides a thorough learning roadmap for Apache Spark, covering its background papers, core concepts such as RDD and fault tolerance, module breakdown, recommended books and repositories, source‑code reading tips, hands‑on projects, and interview‑oriented optimization guidance.

Apache SparkPerformance OptimizationRDD

0 likes · 15 min read

Comprehensive Guide to Learning Apache Spark: Background, Core Concepts, Modules, Resources, and Optimization

Big Data Technology Architecture

May 6, 2021 · Big Data

Using Spark SQL to Operate on Apache Hudi Tables – Step‑by‑Step Guide

This tutorial demonstrates how to use Spark SQL to create, insert, update, delete, merge, and drop Apache Hudi tables, covering environment setup, Spark‑SQL launch, configuration, and a series of SQL commands with example outputs.

Apache HudiSQLSpark SQL

0 likes · 7 min read

Using Spark SQL to Operate on Apache Hudi Tables – Step‑by‑Step Guide

Big Data Technology Architecture

Apr 8, 2021 · Big Data

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.

Big DataHadoopSmall Files

0 likes · 12 min read

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

Big Data Technology & Architecture

Jan 5, 2021 · Big Data

Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains

This article details a real‑world investigation of Spark SQL job latency on a YARN cluster, explains how switching the scheduler to FAIR mode, creating resource pools, and consolidating small Parquet files dramatically reduced scheduler delay and cut execution time from over 100 seconds to under 20 seconds.

ParquetPerformance OptimizationScheduler

0 likes · 13 min read

Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains

Bitu Technology

Dec 16, 2020 · Big Data

Customizing Spark SQL with Macro‑Based Extensions for Column Exclusion and JSON Path Support

This article explains how Tubi customizes Spark SQL using lightweight macro‑based extensions to simplify column exclusion, JSON path queries, and other complex operations without modifying Spark's source code, detailing the two‑stage processing, example macros, and benefits for big‑data workloads.

Big DataCustom SQLMacros

0 likes · 9 min read

Customizing Spark SQL with Macro‑Based Extensions for Column Exclusion and JSON Path Support

Big Data Technology & Architecture

Aug 31, 2020 · Big Data

Integration Methods of Hive and Spark SQL (Potential Interview Topics)

This article provides a comprehensive guide on integrating Hive with Spark SQL, covering Hive‑on‑Spark and Spark‑on‑Hive setups, spark‑shell and spark‑sql usage, HiveServer2 with Beeline, Scala scripts for reading and writing Hive tables, and partition handling for aggregated results.

Big DataData IntegrationHive

0 likes · 7 min read

Integration Methods of Hive and Spark SQL (Potential Interview Topics)

Architects Research Society

Aug 6, 2020 · Big Data

Differences Between Spark SQL and Presto: A Comparative Overview

This article compares Spark SQL and Presto, explaining their architectures, key differences, performance characteristics, supported connectors, installation requirements, and typical use cases, while providing head‑to‑head tables and examples of federated queries.

ComparisonSQL EnginesSpark SQL

0 likes · 10 min read

Differences Between Spark SQL and Presto: A Comparative Overview

Big Data Technology Architecture

Aug 5, 2020 · Big Data

Understanding Join Execution in Spark SQL

This article explains how Spark SQL processes joins—including inner, outer, semi, and anti joins—by describing the overall query planning flow, the three physical join strategies (sort‑merge, broadcast, and hash), and the specific implementation details for each join type.

DataFramesDistributed ComputingJOIN

0 likes · 10 min read

Understanding Join Execution in Spark SQL

Big Data Technology & Architecture

Apr 20, 2020 · Big Data

How Spark SQL Chooses Join Strategies: Broadcast, Shuffle Hash, and Sort Merge

The article explains Spark SQL's Catalyst optimizer rules for selecting among Broadcast hash join, Shuffle hash join, and Sort‑merge join, covering build‑side determination, size thresholds, broadcast hints, local hash‑map construction, and fallback strategies for non‑equi joins.

Big DataBroadcast JoinShuffle Hash Join

0 likes · 10 min read

How Spark SQL Chooses Join Strategies: Broadcast, Shuffle Hash, and Sort Merge

DataFunTalk

Nov 13, 2019 · Big Data

ByteDance’s Core Optimization Practices on Spark SQL

ByteDance’s data warehouse team shares comprehensive optimizations for Spark SQL, covering architecture overview, bucket join enhancements, materialized columns and views, and shuffle stability and performance improvements, illustrating practical techniques that boost query efficiency and job reliability in large‑scale big‑data environments.

Big DataMaterialized ColumnsShuffle Optimization

0 likes · 20 min read

ByteDance’s Core Optimization Practices on Spark SQL

Big Data Technology Architecture

Jul 10, 2019 · Big Data

Introduction to Apache Spark and Its Core Components

Apache Spark, an open‑source unified analytics engine from UC Berkeley’s AMP Lab, is the leading platform for large‑scale batch and streaming data processing, featuring components such as Spark SQL, Streaming, GraphX, MLlib, and core modules like DAGScheduler, TaskScheduler and BlockManager.

Apache SparkBlockManagerDAGScheduler

0 likes · 4 min read

Introduction to Apache Spark and Its Core Components

Big Data Technology Architecture

Jun 9, 2019 · Big Data

An Introduction to Apache Parquet: Architecture, Data Model, File Format, and Basic Operations

This article provides a comprehensive overview of Apache Parquet, covering its purpose, architectural components, nested data model, file structure, practical Hive commands for creating and inspecting Parquet tables, and a brief introduction to the TPC‑DS benchmark for performance testing.

Columnar StorageHiveParquet

0 likes · 8 min read

An Introduction to Apache Parquet: Architecture, Data Model, File Format, and Basic Operations

Beike Product & Technology

Jan 10, 2019 · Big Data

Accelerating QueryEngine with Alluxio in Spark SQL: Architecture, Features, and Performance Evaluation

This article presents the integration of Alluxio as an in‑memory caching layer for QueryEngine's Spark SQL engine, detailing Alluxio's architecture, key features, deployment practice, performance testing methodology, results, and lessons learned for large‑scale ad‑hoc query acceleration.

AlluxioSpark SQLperformance

0 likes · 13 min read

Accelerating QueryEngine with Alluxio in Spark SQL: Architecture, Features, and Performance Evaluation

dbaplus Community

Sep 26, 2017 · Big Data

How to Avoid Common Spark SQL Pitfalls and Boost Performance

This article shares a comprehensive set of practical tips and solutions for common Spark SQL issues—including out‑of‑memory errors, UDF‑induced GC, thread blocking, system‑property initialization, speculation side‑effects, accumulator traps, concurrent job scheduling, and excessive logging—helping engineers improve stability and efficiency of their Spark‑based financial systems.

AccumulatorMemory ManagementPerformance Tuning

0 likes · 15 min read

How to Avoid Common Spark SQL Pitfalls and Boost Performance

ITPUB

Mar 22, 2017 · Big Data

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

This article walks through Spark’s origins, its core RDD concept, how it improves on Hadoop’s MapReduce, the role of in‑memory processing, functional programming support, and the emergence of Spark SQL with DataFrames and the Catalyst optimizer.

Big DataDistributed ComputingMapReduce

0 likes · 25 min read

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

Architecture Digest

May 25, 2016 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article provides a comprehensive guide on tackling Spark performance bottlenecks by diagnosing data skew, locating the offending stages and operators, and applying a range of practical solutions—including Hive pre‑processing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, and combined strategies—followed by an in‑depth discussion of shuffle manager evolution and key configuration parameters for fine‑tuning.

Big DataData SkewPerformance Tuning

0 likes · 35 min read

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning