Tag

SparkSQL

0 views collected around this technical thread.

DeWu Technology
DeWu Technology
Mar 5, 2025 · Big Data

Using ANTLR4 for SQL Parsing, Completion, and Validation in SparkSQL-based Data IDE

The article explains how a large‑scale data‑development IDE leverages ANTLR4 to build a custom SparkSQL parser that provides real‑time syntax checking, auto‑completion, and validation by generating ASTs, using listeners for context, optimizing performance, and exploring future integration with large language models.

ANTLRBig DataParsing
0 likes · 24 min read
Using ANTLR4 for SQL Parsing, Completion, and Validation in SparkSQL-based Data IDE
DataFunTalk
DataFunTalk
Jun 2, 2023 · Big Data

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of the Iceberg data lake, covering its OLAP architecture, reasons for a lake, Iceberg table format advantages over Hive, platform construction, extensive performance optimizations, and real‑world business use cases such as ad‑flow unification, log analysis, audit, and CDC pipelines.

Big DataData LakeIceberg
0 likes · 18 min read
Iceberg Data Lake Implementation and Optimization at iQIYI
NetEase Media Technology Team
NetEase Media Technology Team
Sep 15, 2022 · Big Data

SparkSQL on Kubernetes: NetEase Media's Cloud-Native Big Data Infrastructure Practice

NetEase Media migrated SparkSQL to Kubernetes in 2021, using storage‑compute decoupling, hybrid deployment, custom scripts, Kyuubi failover, and extensive monitoring and resource governance, which cut cluster size by over 30% while keeping CPU utilization above 80% and GC throughput above 95%.

Big DataK8S MigrationKubernetes
0 likes · 13 min read
SparkSQL on Kubernetes: NetEase Media's Cloud-Native Big Data Infrastructure Practice
政采云技术
政采云技术
Jul 12, 2022 · Big Data

Understanding Spark SQL Physical Execution Plans and Optimization Techniques

This article explains Spark SQL's physical execution plan, detailing each operator, how to interpret the plan, and practical optimization tips for data warehouse developers to improve SQL performance and resource utilization.

BigDataDataWarehouseExecutionPlan
0 likes · 10 min read
Understanding Spark SQL Physical Execution Plans and Optimization Techniques
ByteDance Data Platform
ByteDance Data Platform
May 11, 2022 · Big Data

How to Build a High‑Performance SparkSQL Server with Hive JDBC Compatibility

This article explains how to design and implement a SparkSQL server that lowers usage barriers and boosts efficiency by supporting standard JDBC interfaces, integrating Hive Server2 protocols, handling multi‑tenant authentication, managing Spark job lifecycles, and providing high‑availability through Zookeeper coordination.

Big DataHiveJDBC
0 likes · 15 min read
How to Build a High‑Performance SparkSQL Server with Hive JDBC Compatibility
ByteDance Data Platform
ByteDance Data Platform
Feb 25, 2022 · Big Data

Optimizing SparkSQL: ByteDance EMR’s Data Lake Integration and Multi‑Tenant Server

ByteDance’s EMR team details how they integrated data‑lake engines such as Hudi and Iceberg into SparkSQL, streamlined jar management, built a custom Spark SQL Server with Hive compatibility, multi‑tenant support, engine pre‑warming, and transaction capabilities, dramatically improving performance and resource efficiency for enterprise workloads.

Data LakeEMRHudi
0 likes · 11 min read
Optimizing SparkSQL: ByteDance EMR’s Data Lake Integration and Multi‑Tenant Server
ByteDance Data Platform
ByteDance Data Platform
Feb 21, 2022 · Big Data

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

This article examines how to design enterprise‑grade data warehouses by evaluating development convenience, ecosystem, decoupling, performance and security, compares Hive and SparkSQL along with other engines such as Presto, Doris and ClickHouse, and outlines best‑practice component selections for long‑running batch and interactive analytics.

Big DataData WarehouseHive
0 likes · 19 min read
Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL
DataFunSummit
DataFunSummit
Dec 6, 2021 · Big Data

Design and Performance Optimization of a Real‑Time Billion‑Scale Data Processing Pipeline

This article reviews the background, architecture, and a series of performance‑optimizing techniques—including consumption, batch, storage, and execution‑engine tweaks—applied to a real‑time pipeline that processes hundreds of billions of records daily, and presents the resulting resource savings and latency improvements.

Big DataData PipelineKafka
0 likes · 9 min read
Design and Performance Optimization of a Real‑Time Billion‑Scale Data Processing Pipeline
Big Data Technology Architecture
Big Data Technology Architecture
Mar 23, 2021 · Big Data

Overview of SparkSQL Join Execution Process and Implementations

This article explains SparkSQL's overall workflow, introduces the basic elements of joins, and details the physical execution processes for various join types—including sort‑merge, broadcast, and hash joins—along with their implementation conditions and optimization considerations.

Big DataData ProcessingJOIN
0 likes · 12 min read
Overview of SparkSQL Join Execution Process and Implementations
Didi Tech
Didi Tech
Jan 25, 2021 · Big Data

Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi

DiDi migrated over 10,000 Hive SQL tasks to Spark SQL using a lightweight dual‑run pipeline that extracts, rewrites, compares, and switches tasks, fixing syntax and UDF differences while adding features such as small‑file merging and enhanced partition pruning, resulting in Spark handling 85 % of workloads with 40 % faster execution, 21 % lower CPU and 49 % lower memory usage.

BigDataDataMigrationHive
0 likes · 18 min read
Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi
Big Data Technology Architecture
Big Data Technology Architecture
Jun 4, 2020 · Big Data

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

This article presents a comprehensive case study of 58.com’s massive Hadoop‑based offline computing platform, detailing its architecture, scaling challenges, performance‑tuning measures, YARN and SparkSQL upgrades, and the systematic cross‑data‑center migration of thousands of nodes and petabytes of data.

Big DataCluster ScalingHadoop
0 likes · 23 min read
58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration
Youzan Coder
Youzan Coder
Dec 6, 2019 · Big Data

Improving SparkSQL Stability and Performance at Youzan: Thrift Server Enhancements, Metric Collection, and Lessons Learned

Youzan’s big‑data team boosted SparkSQL stability and performance by reinforcing the Thrift Server, implementing AB gray‑release testing, collecting real‑time metrics, adding an engine‑selection service, and completing a second migration that raised SparkSQL’s workload share to 91 %, while documenting key pitfalls and tuning lessons.

AB testingBig DataMetric Collection
0 likes · 15 min read
Improving SparkSQL Stability and Performance at Youzan: Thrift Server Enhancements, Metric Collection, and Lessons Learned
58 Tech
58 Tech
Mar 15, 2019 · Big Data

Optimizing Spark Join Operations in Spark Core and Spark SQL

This article explains how to improve Spark join performance by reducing shuffle, using appropriate partitioners, applying broadcast hash joins for small tables, and selecting the optimal join strategy (broadcast, shuffle hash, or sort‑merge) in both Spark Core and Spark SQL.

Big DataBroadcastJOIN
0 likes · 6 min read
Optimizing Spark Join Operations in Spark Core and Spark SQL
vivo Internet Technology
vivo Internet Technology
Feb 12, 2018 · Big Data

Predicate Pushdown Rules in SparkSQL Outer Join Queries – Detailed Analysis

The article examines SparkSQL’s predicate‑pushdown behavior for left outer joins, detailing four rules that show when pushing join‑condition filters to the left or right tables yields correct, faster results and when it produces incorrect outcomes, highlighting both performance gains and subtle errors.

Big DataOuter JoinPredicate Pushdown
0 likes · 7 min read
Predicate Pushdown Rules in SparkSQL Outer Join Queries – Detailed Analysis
Architecture Digest
Architecture Digest
Apr 18, 2016 · Big Data

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

This article introduces Apache Spark’s core architecture, explains how Spark runs on YARN, details driver and executor roles, describes RDD concepts and dependencies, and outlines SparkSQL’s schema‑based query processing, providing code examples for HiveContext and JDBC integration.

Big DataRDDSpark
0 likes · 14 min read
Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL
Architecture Digest
Architecture Digest
Mar 25, 2016 · Big Data

Design, Evolution, and Performance Evaluation of the PINGO Distributed Interactive Query Platform

This article details the motivation, architectural iterations, caching strategies, SparkSQL enhancements, and performance benchmarks of Baidu's PINGO platform, illustrating how it transformed from a Hive‑based QueryEngine into a high‑performance, Spark‑driven interactive query system for large‑scale data analysis.

Big DataCachingDistributed Query
0 likes · 14 min read
Design, Evolution, and Performance Evaluation of the PINGO Distributed Interactive Query Platform
Qunar Tech Salon
Qunar Tech Salon
Aug 18, 2015 · Big Data

Overview of Spark Big Data Analytics Framework Components

Spark’s big‑data analytics ecosystem comprises core components such as the in‑memory RDD data structure, Streaming for real‑time processing, GraphX for graph analytics, MLlib for machine‑learning, Spark SQL for querying, the Tachyon file system, and SparkR, each enabling scalable, distributed computation.

Big DataGraphXMLlib
0 likes · 5 min read
Overview of Spark Big Data Analytics Framework Components