Tagged articles

SparkSQL

37 articles · Page 1 of 1

Mar 5, 2025 · Big Data

Using ANTLR4 for SQL Parsing, Completion, and Validation in SparkSQL-based Data IDE

The article explains how a large‑scale data‑development IDE leverages ANTLR4 to build a custom SparkSQL parser that provides real‑time syntax checking, auto‑completion, and validation by generating ASTs, using listeners for context, optimizing performance, and exploring future integration with large language models.

ANTLRBig DataParsing

0 likes · 24 min read

Using ANTLR4 for SQL Parsing, Completion, and Validation in SparkSQL-based Data IDE

Big Data Technology & Architecture

Jul 21, 2023 · Big Data

Big Data Interview Experience Summary: Topics, Weightings, and Key Takeaways

The article shares a detailed interview experience for big‑data roles, outlining the proportion of problem‑solving, project, fundamentals, and open‑question segments, and highlights the technical depth expected in areas such as Flink, Hudi, SparkSQL, and OLAP.

Career AdviceData EngineeringFlink

0 likes · 5 min read

Big Data Interview Experience Summary: Topics, Weightings, and Key Takeaways

DataFunTalk

Jun 2, 2023 · Big Data

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of the Iceberg data lake, covering its OLAP architecture, reasons for a lake, Iceberg table format advantages over Hive, platform construction, extensive performance optimizations, and real‑world business use cases such as ad‑flow unification, log analysis, audit, and CDC pipelines.

Big DataData LakeFlink

0 likes · 18 min read

Iceberg Data Lake Implementation and Optimization at iQIYI

Big Data Technology & Architecture

Mar 20, 2023 · Big Data

Using SparkSQL to Connect and Operate with Apache Hudi: Configuration, Table Creation, Data Manipulation, and Deletion

This guide demonstrates how to configure Hive metastore, connect SparkSQL to Apache Hudi, create COW and MOR tables, perform insert, update, merge, delete, and insert‑overwrite operations, and illustrates each step with executable code snippets and sample results.

Apache HudiBig DataData Lake

0 likes · 14 min read

Using SparkSQL to Connect and Operate with Apache Hudi: Configuration, Table Creation, Data Manipulation, and Deletion

NetEase Media Technology Team

Sep 15, 2022 · Big Data

SparkSQL on Kubernetes: NetEase Media's Cloud-Native Big Data Infrastructure Practice

NetEase Media migrated SparkSQL to Kubernetes in 2021, using storage‑compute decoupling, hybrid deployment, custom scripts, Kyuubi failover, and extensive monitoring and resource governance, which cut cluster size by over 30% while keeping CPU utilization above 80% and GC throughput above 95%.

Big DataCloud NativeK8s migration

0 likes · 13 min read

SparkSQL on Kubernetes: NetEase Media's Cloud-Native Big Data Infrastructure Practice

政采云技术

Jul 12, 2022 · Big Data

Understanding Spark SQL Physical Execution Plans and Optimization Techniques

This article explains Spark SQL's physical execution plan, detailing each operator, how to interpret the plan, and practical optimization tips for data warehouse developers to improve SQL performance and resource utilization.

DataWarehouseExecutionPlanPerformanceOptimization

0 likes · 10 min read

Understanding Spark SQL Physical Execution Plans and Optimization Techniques

Architect

May 17, 2022 · Big Data

Design and Architecture of an Integrated BI Platform Using Apache Kylin for Large‑Scale OLAP

The article explains the challenges of big‑data analytics, introduces pre‑computation OLAP concepts, and details how Apache Kylin together with Spark, Flink, Presto and other components can be integrated into a BI platform to achieve near‑real‑time query performance on massive datasets.

Apache KylinBIData Warehouse

0 likes · 11 min read

Design and Architecture of an Integrated BI Platform Using Apache Kylin for Large‑Scale OLAP

ByteDance Data Platform

May 11, 2022 · Big Data

How to Build a High‑Performance SparkSQL Server with Hive JDBC Compatibility

This article explains how to design and implement a SparkSQL server that lowers usage barriers and boosts efficiency by supporting standard JDBC interfaces, integrating Hive Server2 protocols, handling multi‑tenant authentication, managing Spark job lifecycles, and providing high‑availability through Zookeeper coordination.

HiveJDBCServer Architecture

0 likes · 15 min read

How to Build a High‑Performance SparkSQL Server with Hive JDBC Compatibility

ByteDance Data Platform

Feb 25, 2022 · Big Data

Optimizing SparkSQL: ByteDance EMR’s Data Lake Integration and Multi‑Tenant Server

ByteDance’s EMR team details how they integrated data‑lake engines such as Hudi and Iceberg into SparkSQL, streamlined jar management, built a custom Spark SQL Server with Hive compatibility, multi‑tenant support, engine pre‑warming, and transaction capabilities, dramatically improving performance and resource efficiency for enterprise workloads.

EMRHudiIceberg

0 likes · 11 min read

Optimizing SparkSQL: ByteDance EMR’s Data Lake Integration and Multi‑Tenant Server

ByteDance Data Platform

Feb 21, 2022 · Big Data

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

This article examines how to design enterprise‑grade data warehouses by evaluating development convenience, ecosystem, decoupling, performance and security, compares Hive and SparkSQL along with other engines such as Presto, Doris and ClickHouse, and outlines best‑practice component selections for long‑running batch and interactive analytics.

Big DataData WarehouseETL

0 likes · 19 min read

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

Big Data Technology & Architecture

Dec 15, 2021 · Big Data

Understanding Spark DataFrames: Creation Methods, Optimizations, and Common Operations

This article explains the origins of Spark DataFrames, compares them with RDDs, describes how Spark SQL optimizes DataFrame execution, and provides detailed examples of creating DataFrames from RDDs, files, and JDBC sources along with common DataFrame operations and code snippets.

Big DataSQLScala

0 likes · 10 min read

Understanding Spark DataFrames: Creation Methods, Optimizations, and Common Operations

DataFunSummit

Dec 6, 2021 · Big Data

Design and Performance Optimization of a Real‑Time Billion‑Scale Data Processing Pipeline

This article reviews the background, architecture, and a series of performance‑optimizing techniques—including consumption, batch, storage, and execution‑engine tweaks—applied to a real‑time pipeline that processes hundreds of billions of records daily, and presents the resulting resource savings and latency improvements.

Performance OptimizationReal-time ProcessingSparkSQL

0 likes · 9 min read

Design and Performance Optimization of a Real‑Time Billion‑Scale Data Processing Pipeline

Big Data Technology & Architecture

Jul 22, 2021 · Big Data

Comprehensive Overview of SparkSQL: History, Architecture, Execution Process, and Optimization Techniques

This article provides a detailed exploration of SparkSQL, covering its evolution from Shark, core components, execution workflow, Catalyst optimizer, various optimization strategies, and practical configuration tips for achieving high performance in big‑data processing.

Adaptive Query ExecutionCatalyst OptimizerDataFrames

0 likes · 19 min read

Comprehensive Overview of SparkSQL: History, Architecture, Execution Process, and Optimization Techniques

Big Data Technology & Architecture

Jul 8, 2021 · Big Data

Using Flink CDC to Write Data into Apache Hudi and Query with Hive and Spark SQL

This guide walks through preparing the environment, creating a MySQL source table, configuring Flink CDC to ingest data into an Apache Hudi table, and then querying the Hudi data using both Hive and Spark‑SQL, including handling of partitions, realtime input formats, and required configuration settings.

CDCDataPipelineFlink

0 likes · 10 min read

Using Flink CDC to Write Data into Apache Hudi and Query with Hive and Spark SQL

Big Data Technology Architecture

Mar 23, 2021 · Big Data

Overview of SparkSQL Join Execution Process and Implementations

This article explains SparkSQL's overall workflow, introduces the basic elements of joins, and details the physical execution processes for various join types—including sort‑merge, broadcast, and hash joins—along with their implementation conditions and optimization considerations.

Distributed ComputingJOINSQL

0 likes · 12 min read

Overview of SparkSQL Join Execution Process and Implementations

Didi Tech

Jan 25, 2021 · Big Data

Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi

DiDi migrated over 10,000 Hive SQL tasks to Spark SQL using a lightweight dual‑run pipeline that extracts, rewrites, compares, and switches tasks, fixing syntax and UDF differences while adding features such as small‑file merging and enhanced partition pruning, resulting in Spark handling 85 % of workloads with 40 % faster execution, 21 % lower CPU and 49 % lower memory usage.

DataMigrationHiveSQLOptimization

0 likes · 18 min read

Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi

Big Data Technology & Architecture

Dec 28, 2020 · Big Data

Optimizing OLAP Data Source Integration with SparkSQL: Cluster and Node Tuning, Profiling, and GC

This article details the end‑to‑end process of connecting an OLAP data source to SparkSQL and presents a comprehensive performance‑tuning guide covering cluster‑level resource allocation, single‑node On‑CPU/Off‑CPU analysis, flame‑graph profiling, Java Flight Recorder usage, and garbage‑collection optimization.

Cluster OptimizationGCOLAP

0 likes · 16 min read

Optimizing OLAP Data Source Integration with SparkSQL: Cluster and Node Tuning, Profiling, and GC

Big Data Technology & Architecture

Aug 25, 2020 · Big Data

Understanding Spark SQL Query Execution: From Parsing to Physical Plan

This article explains how Spark SQL processes a SELECT query—detailing parsing, binding, optimization, planning, and execution steps—including the roles of SQLContext, HiveContext, Catalyst optimizer, logical and physical plans, and provides code excerpts from the Spark source.

CatalystHiveContextQueryExecution

0 likes · 13 min read

Understanding Spark SQL Query Execution: From Parsing to Physical Plan

Big Data Technology Architecture

Jun 4, 2020 · Big Data

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

This article presents a comprehensive case study of 58.com’s massive Hadoop‑based offline computing platform, detailing its architecture, scaling challenges, performance‑tuning measures, YARN and SparkSQL upgrades, and the systematic cross‑data‑center migration of thousands of nodes and petabytes of data.

Big DataData MigrationHadoop

0 likes · 23 min read

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

Big Data Technology & Architecture

May 29, 2020 · Big Data

SparkSQL Logical Plan, Analyzer, and Optimizer: An In‑Depth Overview

This article provides a comprehensive overview of SparkSQL's logical plan architecture, detailing the stages of logical plan creation, analysis, rule‑based optimization, and the underlying catalog and rule systems that transform SQL queries into efficient execution plans.

LogicalPlanScalaSparkSQL

0 likes · 12 min read

SparkSQL Logical Plan, Analyzer, and Optimizer: An In‑Depth Overview

Big Data Technology & Architecture

Apr 25, 2020 · Big Data

Integrating SparkSQL with Hive: Configuration, MetaStore Setup, and Example Scala Code

This article explains the differences between Spark on Hive and Hive on Spark, then provides step‑by‑step instructions for configuring Hive MetaStore, setting up SparkSQL to use Hive, and demonstrates a complete Scala program that creates a Hive table, loads data, and queries it.

Big DataData IntegrationHive

0 likes · 7 min read

Integrating SparkSQL with Hive: Configuration, MetaStore Setup, and Example Scala Code

DataFunTalk

Apr 9, 2020 · Big Data

Scaling and Optimizing 58.com’s Hadoop‑Based Offline Computing Platform: Architecture, Challenges, and Solutions

This article details how 58.com built a massive Hadoop‑based offline computing platform with over 4,000 servers and hundreds of petabytes of storage, addressing scaling, stability, GC, YARN scheduling, SparkSQL migration, storage operations, and a large‑scale cross‑datacenter migration.

Big DataData MigrationHadoop

0 likes · 24 min read

Scaling and Optimizing 58.com’s Hadoop‑Based Offline Computing Platform: Architecture, Challenges, and Solutions

Youzan Coder

Dec 6, 2019 · Big Data

Improving SparkSQL Stability and Performance at Youzan: Thrift Server Enhancements, Metric Collection, and Lessons Learned

Youzan’s big‑data team boosted SparkSQL stability and performance by reinforcing the Thrift Server, implementing AB gray‑release testing, collecting real‑time metrics, adding an engine‑selection service, and completing a second migration that raised SparkSQL’s workload share to 91 %, while documenting key pitfalls and tuning lessons.

AB testingPerformance OptimizationSparkSQL

0 likes · 15 min read

Improving SparkSQL Stability and Performance at Youzan: Thrift Server Enhancements, Metric Collection, and Lessons Learned

Big Data Technology & Architecture

Nov 16, 2019 · Big Data

Understanding SparkSQL Join Algorithms: Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join

This article explains SparkSQL's three join strategies—Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join—detailing their mechanisms, when to use each based on table size, and their relative performance costs in distributed big‑data environments.

Big DataBroadcast JoinHash Join

0 likes · 5 min read

Understanding SparkSQL Join Algorithms: Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join

Big Data Technology & Architecture

Jul 6, 2019 · Big Data

Understanding Broadcast, Shuffle, and Sort‑Merge Joins in Spark SQL

This article explains the principles, use cases, and performance considerations of Spark SQL's three join implementations—Broadcast Hash Join, Shuffle Hash Join, and Sort‑Merge Join—illustrating how table size and distribution affect the choice of algorithm for efficient large‑scale data processing.

Big DataBroadcast JoinJoin Algorithms

0 likes · 11 min read

Understanding Broadcast, Shuffle, and Sort‑Merge Joins in Spark SQL

Big Data Technology & Architecture

Jun 17, 2019 · Big Data

Understanding Spark SQL: Concepts, Queries, Data Sources, and Practical Examples

This article introduces Spark SQL fundamentals, including its architecture, DataFrame and Dataset abstractions, query methods, interoperability with RDD, user-defined functions, integration with Hive, data source handling, and provides step‑by‑step Scala code examples for loading data, performing aggregations, and solving common analytical tasks.

DataFramesHiveSQL

0 likes · 15 min read

Understanding Spark SQL: Concepts, Queries, Data Sources, and Practical Examples

Big Data Technology & Architecture

Jun 10, 2019 · Big Data

Understanding Spark SQL: Origin, Features, and Columnar Storage

This article explains the evolution of Spark SQL from Shark, describes its key features such as SchemaRDD and in‑memory columnar storage, compares row‑based and column‑based storage, and provides practical Scala code examples for creating DataFrames and loading data from various sources.

Big DataJDBCParquet

0 likes · 16 min read

Understanding Spark SQL: Origin, Features, and Columnar Storage

58 Tech

Mar 15, 2019 · Big Data

Optimizing Spark Join Operations in Spark Core and Spark SQL

This article explains how to improve Spark join performance by reducing shuffle, using appropriate partitioners, applying broadcast hash joins for small tables, and selecting the optimal join strategy (broadcast, shuffle hash, or sort‑merge) in both Spark Core and Spark SQL.

JOINOptimizationShuffle

0 likes · 6 min read

Optimizing Spark Join Operations in Spark Core and Spark SQL

Youzan Coder

Jan 9, 2019 · Big Data

How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive

This article details Youzan's transition from Hive to SparkSQL, covering platform architecture, usability and performance enhancements, migration strategies, automated engine selection, and future plans that together reduced resource consumption by up to 67% while handling thousands of daily jobs.

Big DataData PlatformHive Migration

0 likes · 13 min read

How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive

vivo Internet Technology

Feb 12, 2018 · Big Data

Predicate Pushdown Rules in SparkSQL Outer Join Queries – Detailed Analysis

The article examines SparkSQL’s predicate‑pushdown behavior for left outer joins, detailing four rules that show when pushing join‑condition filters to the left or right tables yields correct, faster results and when it produces incorrect outcomes, highlighting both performance gains and subtle errors.

Outer JoinPredicate PushdownQuery Optimization

0 likes · 7 min read

Predicate Pushdown Rules in SparkSQL Outer Join Queries – Detailed Analysis

Hulu Beijing

Dec 20, 2016 · Big Data

How Hulu Supercharges OLAP Queries with CarbonData: Real‑World Optimizations

This article describes Hulu’s real‑world OLAP query optimization, covering the fundamentals of OLAP, comparisons of row‑ and column‑based storage formats, detailed indexing mechanisms of Parquet, ORC and CarbonData, and the specific schema, shuffle, block size, speculation and GC tuning techniques that enabled CarbonData to dramatically accelerate wide‑table queries on SparkSQL.

Big DataCarbonDataColumnar Storage

0 likes · 17 min read

How Hulu Supercharges OLAP Queries with CarbonData: Real‑World Optimizations

21CTO

Apr 18, 2016 · Big Data

How Spark Runs on YARN: From Client Submission to Executor Execution

This article explains the end‑to‑end workflow of Spark on YARN, covering client initialization, ApplicationMaster actions, driver and executor roles, RDD fundamentals, SparkSQL processing, and practical code examples for building and tuning distributed Spark jobs.

Distributed ComputingRDDSpark

0 likes · 17 min read

How Spark Runs on YARN: From Client Submission to Executor Execution

Architecture Digest

Apr 18, 2016 · Big Data

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

This article introduces Apache Spark’s core architecture, explains how Spark runs on YARN, details driver and executor roles, describes RDD concepts and dependencies, and outlines SparkSQL’s schema‑based query processing, providing code examples for HiveContext and JDBC integration.

Big DataDistributed ComputingRDD

0 likes · 14 min read

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

21CTO

Mar 30, 2016 · Big Data

Unveiling Spark on YARN: From RDD Basics to Cluster Execution

This article explains Apache Spark’s core concepts, the RDD programming model, how Spark runs on YARN with driver and executor nodes, the distinction between transformations and actions, partitioning strategies, and an overview of SparkSQL processing.

Apache SparkRDDSparkSQL

0 likes · 18 min read

Unveiling Spark on YARN: From RDD Basics to Cluster Execution

Architecture Digest

Mar 25, 2016 · Big Data

Design, Evolution, and Performance Evaluation of the PINGO Distributed Interactive Query Platform

This article details the motivation, architectural iterations, caching strategies, SparkSQL enhancements, and performance benchmarks of Baidu's PINGO platform, illustrating how it transformed from a Hive‑based QueryEngine into a high‑performance, Spark‑driven interactive query system for large‑scale data analysis.

CachingDistributed QueryPINGO

0 likes · 14 min read

Design, Evolution, and Performance Evaluation of the PINGO Distributed Interactive Query Platform

ITPUB

Dec 29, 2015 · Big Data

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

This article explains SparkSQL's query processing pipeline—from parsing and logical planning through optimization and physical execution—highlighting why it often outperforms Hive on MapReduce by reducing I/O, minimizing shuffle stages, and reusing JVMs.

Big DataDistributed ComputingHive

0 likes · 13 min read

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

Qunar Tech Salon

Aug 18, 2015 · Big Data

Overview of Spark Big Data Analytics Framework Components

Spark’s big‑data analytics ecosystem comprises core components such as the in‑memory RDD data structure, Streaming for real‑time processing, GraphX for graph analytics, MLlib for machine‑learning, Spark SQL for querying, the Tachyon file system, and SparkR, each enabling scalable, distributed computation.

Big DataGraphXMLlib

0 likes · 5 min read

Overview of Spark Big Data Analytics Framework Components