Tag

spark sql

0 views collected around this technical thread.

Sohu Tech Products
Sohu Tech Products
Jun 11, 2025 · Big Data

How We Transformed a Microservice Finance System into a Scalable Big Data Warehouse

This article details the evolution of a fast‑growing finance reporting system from a monolithic microservice architecture plagued by data inconsistency, low efficiency, and scalability limits to a robust, high‑performance big‑data warehouse built with layered data models, SparkSQL processing, and unified scheduling, highlighting design decisions, technical trade‑offs, and measurable performance gains.

Big DataData WarehouseMicroservices
0 likes · 23 min read
How We Transformed a Microservice Finance System into a Scalable Big Data Warehouse
DataFunSummit
DataFunSummit
Mar 12, 2025 · Big Data

Principles and Common Optimization Techniques of the Spark SQL Optimizer

This article explains the underlying principles of the Spark SQL optimizer and presents three classic optimization paradigms—push‑down optimization, operator elimination/merging, and expression elimination/replacement—illustrating each with concrete rule implementations and code examples.

Big DataQuery Optimizationoptimizer
0 likes · 12 min read
Principles and Common Optimization Techniques of the Spark SQL Optimizer
DataFunSummit
DataFunSummit
Jan 9, 2025 · Big Data

Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A

This article explains Spark SQL's window function fundamentals, introduces two key optimizations—Offset Window Frame and Infer Window Group Limit—and provides a detailed Q&A covering implementation details, execution plan impacts, and underlying architecture.

Apache SparkBig DataSQL performance
0 likes · 13 min read
Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A
DataFunSummit
DataFunSummit
Dec 9, 2024 · Big Data

Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding

This article examines Spark SQL expression-level optimizations, focusing on redesigning LIKE ALL and LIKE ANY to reduce memory and stack usage, refactoring the TRIM function for better code reuse and performance, and implementing constant folding to cache computed constant expressions, thereby enhancing query efficiency in big-data workloads.

Big DataConstant FoldingExpression Optimization
0 likes · 16 min read
Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding
DataFunSummit
DataFunSummit
Nov 11, 2024 · Big Data

Understanding Spark SQL Parsing Layer and Its Optimizations

This talk, the third in a Spark series, introduces the Spark SQL parsing layer, explains its architecture and integration with ANTLR4, details core implementation classes, and presents a real‑world optimization case that reduces code complexity and improves maintainability.

ANTLR4Big DataParsing
0 likes · 15 min read
Understanding Spark SQL Parsing Layer and Its Optimizations
DataFunSummit
DataFunSummit
Aug 1, 2024 · Big Data

Deep Dive into Apache Spark SQL: Concepts, Core Components, and API

This article provides a comprehensive overview of Apache Spark SQL, covering its fundamental concepts such as TreeNode, AST, and QueryPlan, the distinction between logical and physical plans, the rule‑execution framework, core components like SparkSqlParser and Analyzer, as well as the Spark Session, Dataset/DataFrame, and various writer APIs, supplemented by a detailed Q&A session.

Apache SparkBig DataData Processing
0 likes · 19 min read
Deep Dive into Apache Spark SQL: Concepts, Core Components, and API
DataFunTalk
DataFunTalk
Apr 9, 2024 · Big Data

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

Apache SparkBig DataHive
0 likes · 14 min read
Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform
DataFunSummit
DataFunSummit
Jun 16, 2023 · Big Data

Apache Kyuubi Practices and Service Evolution at iQIYI

This article details iQIYI's implementation of Apache Kyuubi for Spark Thrift Server, covering the evolution from native Spark Thrift to Kyuubi 0.7 and 1.x, multi‑tenant architecture, tag‑based configurations, SQL auditing, lineage collection, service monitoring, small‑file and Z‑order optimizations, and a brief Q&A.

Apache KyuubiBig DataData Platform
0 likes · 15 min read
Apache Kyuubi Practices and Service Evolution at iQIYI
iQIYI Technical Product Team
iQIYI Technical Product Team
Jun 9, 2023 · Big Data

Accelerating iQIYI Big Data Platform: Migrating from Hive to Spark SQL

iQIYI accelerated its big‑data platform by migrating the OLAP layer from Hive to Spark SQL, achieving a 67 % speedup, 50 % CPU reduction and 44 % memory savings, while automating the conversion of tens of thousands of tasks and delivering faster analytics for advertising, BI, membership and user‑growth services.

AutomationBig DataHive
0 likes · 18 min read
Accelerating iQIYI Big Data Platform: Migrating from Hive to Spark SQL
DataFunSummit
DataFunSummit
Jan 22, 2023 · Big Data

Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned

This article details how Ping An Insurance migrated its offline Hive SQL workloads to Spark SQL, covering business background, deployment mode selection, migration workflow, typical challenges, optimization measures, and the resulting performance and resource utilization improvements.

Big DataCluster MigrationDeployment Modes
0 likes · 16 min read
Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned
DataFunTalk
DataFunTalk
Jul 16, 2022 · Big Data

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

The article provides an in‑depth overview of Apache Hudi 0.11.0, covering its new multi‑level index design, Spark SQL enhancements, Flink integration improvements, and additional performance and usability features aimed at boosting read/write efficiency in large‑scale data lake environments.

Apache HudiBig DataData Lake
0 likes · 15 min read
Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements
vivo Internet Technology
vivo Internet Technology
Apr 20, 2022 · Big Data

Implementing Field Lineage in Spark SQL: A Technical Deep Dive

The article details how to add field‑lineage tracking to Spark SQL by creating a custom SparkSessionExtension that injects a check‑analysis rule and a parser, which capture INSERT statements, analyze the physical plan, and generate a JSON mapping of source‑to‑target fields for data governance.

Data GovernanceData TransformationField Lineage
0 likes · 9 min read
Implementing Field Lineage in Spark SQL: A Technical Deep Dive
Big Data Technology Architecture
Big Data Technology Architecture
May 6, 2021 · Big Data

Using Spark SQL to Operate on Apache Hudi Tables – Step‑by‑Step Guide

This tutorial demonstrates how to use Spark SQL to create, insert, update, delete, merge, and drop Apache Hudi tables, covering environment setup, Spark‑SQL launch, configuration, and a series of SQL commands with example outputs.

Apache HudiBig DataData Lake
0 likes · 7 min read
Using Spark SQL to Operate on Apache Hudi Tables – Step‑by‑Step Guide
Big Data Technology Architecture
Big Data Technology Architecture
Apr 8, 2021 · Big Data

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.

Big DataHadoopPerformance
0 likes · 12 min read
Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions
Bitu Technology
Bitu Technology
Dec 16, 2020 · Big Data

Customizing Spark SQL with Macro‑Based Extensions for Column Exclusion and JSON Path Support

This article explains how Tubi customizes Spark SQL using lightweight macro‑based extensions to simplify column exclusion, JSON path queries, and other complex operations without modifying Spark's source code, detailing the two‑stage processing, example macros, and benefits for big‑data workloads.

Big DataCustom SQLMacros
0 likes · 9 min read
Customizing Spark SQL with Macro‑Based Extensions for Column Exclusion and JSON Path Support
Architects Research Society
Architects Research Society
Aug 6, 2020 · Big Data

Differences Between Spark SQL and Presto: A Comparative Overview

This article compares Spark SQL and Presto, explaining their architectures, key differences, performance characteristics, supported connectors, installation requirements, and typical use cases, while providing head‑to‑head tables and examples of federated queries.

Big DataComparisonSQL Engines
0 likes · 10 min read
Differences Between Spark SQL and Presto: A Comparative Overview
Big Data Technology Architecture
Big Data Technology Architecture
Aug 5, 2020 · Big Data

Understanding Join Execution in Spark SQL

This article explains how Spark SQL processes joins—including inner, outer, semi, and anti joins—by describing the overall query planning flow, the three physical join strategies (sort‑merge, broadcast, and hash), and the specific implementation details for each join type.

Big DataDataFramesJOIN
0 likes · 10 min read
Understanding Join Execution in Spark SQL
DataFunTalk
DataFunTalk
Nov 13, 2019 · Big Data

ByteDance’s Core Optimization Practices on Spark SQL

ByteDance’s data warehouse team shares comprehensive optimizations for Spark SQL, covering architecture overview, bucket join enhancements, materialized columns and views, and shuffle stability and performance improvements, illustrating practical techniques that boost query efficiency and job reliability in large‑scale big‑data environments.

Big DataData WarehouseMaterialized Columns
0 likes · 20 min read
ByteDance’s Core Optimization Practices on Spark SQL
Big Data Technology Architecture
Big Data Technology Architecture
Jul 10, 2019 · Big Data

Introduction to Apache Spark and Its Core Components

Apache Spark, an open‑source unified analytics engine from UC Berkeley’s AMP Lab, is the leading platform for large‑scale batch and streaming data processing, featuring components such as Spark SQL, Streaming, GraphX, MLlib, and core modules like DAGScheduler, TaskScheduler and BlockManager.

Apache SparkBlockManagerDAGScheduler
0 likes · 4 min read
Introduction to Apache Spark and Its Core Components