Big Data 15 min read

Improving SparkSQL Stability and Performance at Youzan: Thrift Server Enhancements, Metric Collection, and Lessons Learned

Youzan’s big‑data team boosted SparkSQL stability and performance by reinforcing the Thrift Server, implementing AB gray‑release testing, collecting real‑time metrics, adding an engine‑selection service, and completing a second migration that raised SparkSQL’s workload share to 91 %, while documenting key pitfalls and tuning lessons.

Youzan Coder
Youzan Coder
Youzan Coder
Improving SparkSQL Stability and Performance at Youzan: Thrift Server Enhancements, Metric Collection, and Lessons Learned

In January 2019 the Youzan big‑data team published a blog titled “SparkSQL Practice at Youzan”, describing early optimizations and migration from Hive to SparkSQL. This article continues that story, detailing subsequent improvements that raised SparkSQL’s share to 91% and sharing practical lessons and pitfalls.

The main topics covered are:

Thrift Server stability construction

SparkSQL second migration

Various pitfalls and experience

Thrift Server stability construction

After the first Hive‑to‑SparkSQL migration, SparkSQL tasks accounted for 60% of the workload, bringing a surge of failures and service unavailability. The team runs SparkSQL via a Thrift Server deployed in YARN client mode, providing JDBC access for both offline Airflow jobs (via beeline ) and ad‑hoc query services. The Thrift Server model uses a single driver to manage a pool of executors, achieving higher executor utilization compared with launching separate spark‑sql processes, which suffer from driver OOM and lower YARN resource efficiency (overhead of starting YARN applications and executor churn). Because most SQL tasks finish within 3 minutes, the Thrift Server approach is deemed appropriate.

Stability work includes:

AB testing gray‑release functionality

Spark metric collection, analysis, and processing

SQL‑engine selection service that blocks unsuitable queries

2.1 AB gray‑release testing

Frequent yz‑spark version and configuration changes prompted the need for a custom AB testing solution to reduce release risk for large or impactful changes. The solution routes low‑priority SQL tasks to a gray‑release group during off‑peak hours, based on configurable priority , time window , and traffic ratio parameters. Implementation is done by extending Apache Airflow with additional routing rules.

2.2 Spark metric collection

Spark records metrics for jobs, stages, tasks, and executors. The team uses Spark’s REST API (which requires the Spark UI to be enabled) and the EventLog (persisted to HDFS) to gather metric data. By combining the REST API with EventLog replay, they built a spark‑monitor application that reads events in near‑real time, aggregates job‑level information, and stores it in HBase. The collected data serves three main purposes:

Real‑time alerting and intervention (e.g., detecting long‑running or high‑task‑count jobs and triggering alarms or job kills)

Offline analysis (e.g., identifying failed tasks, high memory usage, or data skew for further optimization)

Historical job data preservation for post‑mortem debugging, since the Spark UI only keeps recent job information in memory.

2.3 SQL engine selection

Youzan’s offline platform offers three engines—Presto, SparkSQL, and Hive. A custom engine‑selection service parses the SQL, performs syntax checks, matches rules, and considers each engine’s resource load to recommend the most suitable engine. For SparkSQL, custom rules block queries such as Not in Subquery and CrossJoin that are known to cause instability.

SparkSQL second migration

The first migration covered low‑priority tasks (P4, P5). After validating stability improvements, the team proceeded with a large‑scale migration of high‑priority tasks (P1‑P3). SparkSQL’s share of engine‑selected jobs now exceeds 91% . Benefits include:

Significant reduction in offline cluster resource cost: TPC‑DS benchmarks show SparkSQL can be 2~10 times faster than Hive on comparable resources, with typical speed‑ups around 2 x.

More rational resource allocation: Spark’s built‑in scheduler allows priority‑based resource pools (e.g., FairScheduler) so high‑priority jobs receive proportionally more executor capacity.

Pitfalls and experience

Several practical issues were encountered and addressed:

spark.sql.autoBroadcastJoinThreshold : Default threshold is 10 MB, but actual memory consumption can reach 1G after accounting for file size, column pruning, and compression, leading to OOM risks.

spark.blacklist.enabled : Enabling the blacklist prevents repeated retries on a failing executor/host, improving fault tolerance.

spark.scheduler.pool : Using FairScheduler with weighted pools mitigates resource starvation for high‑priority jobs.

spark.sql.adaptive.enabled : Adaptive execution automatically adjusts downstream task numbers based on upstream shuffle size, helping with the “small file” problem in Hive tables.

SPARK‑24809 : A correctness bug in broadcast join caused incorrect results due to repeated serialization of LongToUnsafeRowMap objects.

SPARK‑26604 : An external shuffle memory‑leak bug could increase shuffle fetch time by up to 30% in production.

Not in Subquery : This pattern forces a BroadcastNestedLoopJoinExec , which is highly inefficient and can cause driver OOM when the subquery result is large.

Conclusion

To date, Youzan’s big‑data offline platform has achieved a milestone by migrating from Hive to SparkSQL. Although SparkSQL’s stability still lags behind Hive in some memory‑management aspects, its clear performance advantages make it a compelling choice, and the team continues to tune SparkSQL toward Hive‑level reliability.

performance optimizationAB testingBig DataSparkSQLMetric CollectionThrift Server
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.