Tagged articles
37 articles
Page 1 of 1
DataFunTalk
DataFunTalk
Jun 2, 2023 · Big Data

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of the Iceberg data lake, covering its OLAP architecture, reasons for a lake, Iceberg table format advantages over Hive, platform construction, extensive performance optimizations, and real‑world business use cases such as ad‑flow unification, log analysis, audit, and CDC pipelines.

Big DataData LakeFlink
0 likes · 18 min read
Iceberg Data Lake Implementation and Optimization at iQIYI
ByteDance Data Platform
ByteDance Data Platform
May 11, 2022 · Big Data

How to Build a High‑Performance SparkSQL Server with Hive JDBC Compatibility

This article explains how to design and implement a SparkSQL server that lowers usage barriers and boosts efficiency by supporting standard JDBC interfaces, integrating Hive Server2 protocols, handling multi‑tenant authentication, managing Spark job lifecycles, and providing high‑availability through Zookeeper coordination.

HiveJDBCServer Architecture
0 likes · 15 min read
How to Build a High‑Performance SparkSQL Server with Hive JDBC Compatibility
ByteDance Data Platform
ByteDance Data Platform
Feb 25, 2022 · Big Data

Optimizing SparkSQL: ByteDance EMR’s Data Lake Integration and Multi‑Tenant Server

ByteDance’s EMR team details how they integrated data‑lake engines such as Hudi and Iceberg into SparkSQL, streamlined jar management, built a custom Spark SQL Server with Hive compatibility, multi‑tenant support, engine pre‑warming, and transaction capabilities, dramatically improving performance and resource efficiency for enterprise workloads.

EMRHudiIceberg
0 likes · 11 min read
Optimizing SparkSQL: ByteDance EMR’s Data Lake Integration and Multi‑Tenant Server
ByteDance Data Platform
ByteDance Data Platform
Feb 21, 2022 · Big Data

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

This article examines how to design enterprise‑grade data warehouses by evaluating development convenience, ecosystem, decoupling, performance and security, compares Hive and SparkSQL along with other engines such as Presto, Doris and ClickHouse, and outlines best‑practice component selections for long‑running batch and interactive analytics.

Big DataData WarehouseETL
0 likes · 19 min read
Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL
DataFunSummit
DataFunSummit
Dec 6, 2021 · Big Data

Design and Performance Optimization of a Real‑Time Billion‑Scale Data Processing Pipeline

This article reviews the background, architecture, and a series of performance‑optimizing techniques—including consumption, batch, storage, and execution‑engine tweaks—applied to a real‑time pipeline that processes hundreds of billions of records daily, and presents the resulting resource savings and latency improvements.

KafkaPerformance OptimizationReal-time Processing
0 likes · 9 min read
Design and Performance Optimization of a Real‑Time Billion‑Scale Data Processing Pipeline
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 22, 2021 · Big Data

Comprehensive Overview of SparkSQL: History, Architecture, Execution Process, and Optimization Techniques

This article provides a detailed exploration of SparkSQL, covering its evolution from Shark, core components, execution workflow, Catalyst optimizer, various optimization strategies, and practical configuration tips for achieving high performance in big‑data processing.

Adaptive Query ExecutionCatalyst OptimizerDataFrames
0 likes · 19 min read
Comprehensive Overview of SparkSQL: History, Architecture, Execution Process, and Optimization Techniques
Didi Tech
Didi Tech
Jan 25, 2021 · Big Data

Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi

DiDi migrated over 10,000 Hive SQL tasks to Spark SQL using a lightweight dual‑run pipeline that extracts, rewrites, compares, and switches tasks, fixing syntax and UDF differences while adding features such as small‑file merging and enhanced partition pruning, resulting in Spark handling 85 % of workloads with 40 % faster execution, 21 % lower CPU and 49 % lower memory usage.

DataMigrationHiveSQLOptimization
0 likes · 18 min read
Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 28, 2020 · Big Data

Optimizing OLAP Data Source Integration with SparkSQL: Cluster and Node Tuning, Profiling, and GC

This article details the end‑to‑end process of connecting an OLAP data source to SparkSQL and presents a comprehensive performance‑tuning guide covering cluster‑level resource allocation, single‑node On‑CPU/Off‑CPU analysis, flame‑graph profiling, Java Flight Recorder usage, and garbage‑collection optimization.

Cluster OptimizationOLAPProfiling
0 likes · 16 min read
Optimizing OLAP Data Source Integration with SparkSQL: Cluster and Node Tuning, Profiling, and GC
Big Data Technology Architecture
Big Data Technology Architecture
Jun 4, 2020 · Big Data

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

This article presents a comprehensive case study of 58.com’s massive Hadoop‑based offline computing platform, detailing its architecture, scaling challenges, performance‑tuning measures, YARN and SparkSQL upgrades, and the systematic cross‑data‑center migration of thousands of nodes and petabytes of data.

Big DataData MigrationHadoop
0 likes · 23 min read
58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration
Youzan Coder
Youzan Coder
Dec 6, 2019 · Big Data

Improving SparkSQL Stability and Performance at Youzan: Thrift Server Enhancements, Metric Collection, and Lessons Learned

Youzan’s big‑data team boosted SparkSQL stability and performance by reinforcing the Thrift Server, implementing AB gray‑release testing, collecting real‑time metrics, adding an engine‑selection service, and completing a second migration that raised SparkSQL’s workload share to 91 %, while documenting key pitfalls and tuning lessons.

AB testingPerformance OptimizationSparkSQL
0 likes · 15 min read
Improving SparkSQL Stability and Performance at Youzan: Thrift Server Enhancements, Metric Collection, and Lessons Learned
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 17, 2019 · Big Data

Understanding Spark SQL: Concepts, Queries, Data Sources, and Practical Examples

This article introduces Spark SQL fundamentals, including its architecture, DataFrame and Dataset abstractions, query methods, interoperability with RDD, user-defined functions, integration with Hive, data source handling, and provides step‑by‑step Scala code examples for loading data, performing aggregations, and solving common analytical tasks.

DataFramesHiveSQL
0 likes · 15 min read
Understanding Spark SQL: Concepts, Queries, Data Sources, and Practical Examples
58 Tech
58 Tech
Mar 15, 2019 · Big Data

Optimizing Spark Join Operations in Spark Core and Spark SQL

This article explains how to improve Spark join performance by reducing shuffle, using appropriate partitioners, applying broadcast hash joins for small tables, and selecting the optimal join strategy (broadcast, shuffle hash, or sort‑merge) in both Spark Core and Spark SQL.

JOINShuffleSpark
0 likes · 6 min read
Optimizing Spark Join Operations in Spark Core and Spark SQL
Youzan Coder
Youzan Coder
Jan 9, 2019 · Big Data

How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive

This article details Youzan's transition from Hive to SparkSQL, covering platform architecture, usability and performance enhancements, migration strategies, automated engine selection, and future plans that together reduced resource consumption by up to 67% while handling thousands of daily jobs.

AvailabilityBig DataData Platform
0 likes · 13 min read
How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive
Hulu Beijing
Hulu Beijing
Dec 20, 2016 · Big Data

How Hulu Supercharges OLAP Queries with CarbonData: Real‑World Optimizations

This article describes Hulu’s real‑world OLAP query optimization, covering the fundamentals of OLAP, comparisons of row‑ and column‑based storage formats, detailed indexing mechanisms of Parquet, ORC and CarbonData, and the specific schema, shuffle, block size, speculation and GC tuning techniques that enabled CarbonData to dramatically accelerate wide‑table queries on SparkSQL.

Big DataCarbonDataColumnar Storage
0 likes · 17 min read
How Hulu Supercharges OLAP Queries with CarbonData: Real‑World Optimizations
21CTO
21CTO
Apr 18, 2016 · Big Data

How Spark Runs on YARN: From Client Submission to Executor Execution

This article explains the end‑to‑end workflow of Spark on YARN, covering client initialization, ApplicationMaster actions, driver and executor roles, RDD fundamentals, SparkSQL processing, and practical code examples for building and tuning distributed Spark jobs.

RDDSparkSparkSQL
0 likes · 17 min read
How Spark Runs on YARN: From Client Submission to Executor Execution
21CTO
21CTO
Mar 30, 2016 · Big Data

Unveiling Spark on YARN: From RDD Basics to Cluster Execution

This article explains Apache Spark’s core concepts, the RDD programming model, how Spark runs on YARN with driver and executor nodes, the distinction between transformations and actions, partitioning strategies, and an overview of SparkSQL processing.

Apache SparkRDDSparkSQL
0 likes · 18 min read
Unveiling Spark on YARN: From RDD Basics to Cluster Execution
Architecture Digest
Architecture Digest
Mar 25, 2016 · Big Data

Design, Evolution, and Performance Evaluation of the PINGO Distributed Interactive Query Platform

This article details the motivation, architectural iterations, caching strategies, SparkSQL enhancements, and performance benchmarks of Baidu's PINGO platform, illustrating how it transformed from a Hive‑based QueryEngine into a high‑performance, Spark‑driven interactive query system for large‑scale data analysis.

Distributed QueryPINGOSparkSQL
0 likes · 14 min read
Design, Evolution, and Performance Evaluation of the PINGO Distributed Interactive Query Platform
ITPUB
ITPUB
Dec 29, 2015 · Big Data

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

This article explains SparkSQL's query processing pipeline—from parsing and logical planning through optimization and physical execution—highlighting why it often outperforms Hive on MapReduce by reducing I/O, minimizing shuffle stages, and reusing JVMs.

Big DataHiveSparkSQL
0 likes · 13 min read
How SparkSQL Executes Queries Faster Than Hive: A Deep Dive
Qunar Tech Salon
Qunar Tech Salon
Aug 18, 2015 · Big Data

Overview of Spark Big Data Analytics Framework Components

Spark’s big‑data analytics ecosystem comprises core components such as the in‑memory RDD data structure, Streaming for real‑time processing, GraphX for graph analytics, MLlib for machine‑learning, Spark SQL for querying, the Tachyon file system, and SparkR, each enabling scalable, distributed computation.

Big DataGraphXMLlib
0 likes · 5 min read
Overview of Spark Big Data Analytics Framework Components