Migrating from Hive3 on Tez to Spark SQL: Practices, Challenges, and Performance Evaluation
This article details the Snowball data team's migration from Hive3 on Tez to Spark SQL, covering the motivations, comparative performance tests, encountered compatibility issues, configuration work‑arounds, and future plans for consolidating ETL workloads on Spark.
Background
Due to business needs, the Snowball data team built a new HDP 3.1.5 cluster (Hadoop 3.1.1 + Hive 3.1.0 + Tez 0.9.1) and initially used Hive3 on Tez as the ETL engine. They encountered frequent SQL failures, incorrect CBO plans, timezone bugs, inaccurate results for complex multi‑join queries, and other stability problems.
Because most companies still use Hive2 and Hive/Tez development is slow, the team evaluated Hive3 on Tez, Hive3 on MR, Hive3 on Spark2, and Spark SQL, finally choosing Spark SQL for its accuracy, stability, and efficiency.
Migration Process
Following Facebook's approach, a shadow execution was added: Hive SQL logs are replayed as Spark SQL, writing to shadow tables for side‑by‑side comparison of correctness and performance. Snowball built similar tools, leveraging their Yarn‑based scheduler to capture execution time and resource usage, and a Trino‑based correctness checker.
Two‑stage testing showed Hive3 on Tez achieved ~50% correctness, Hive3 on MR ~70%, Hive3 on Spark2 (with CBO disabled) 100%, and Spark SQL 100%.
Hive3 on MR
Hive3 on Tez
Hive3 on Spark2 (CBO off)
Spark SQL
Hive on Spark2 (sec)
Spark SQL (sec)
Reduction
70%
50%
100%
100%
1957
423
88%
In real‑world workloads, Spark SQL matched Hive3 on Tez in execution time but used only one‑third of the resources, while Hive3 on Spark2 suffered from data skew.
After careful evaluation, the team first switched to Hive3 on Spark2 as an interim solution, then adopted Spark 3.2.1 for long‑term support, gradually migrating Hive SQL to Spark SQL.
Encountered Issues and Solutions
1. Recursive Sub‑directory and Self‑Read Problems
Hive supports set hive.mapred.supports.subdirectories=true; and set mapreduce.input.fileinputformat.input.dir.recursive=true; , but Spark SQL does not. Setting spark.sql.hive.convertMetastoreOrc=false forces Spark to use Hive’s parser, enabling recursive directory handling.
When reading and overwriting the same Hive table, Spark throws “Cannot overwrite a path that is also being read from”. Using Hive’s parser resolves this because Hive writes to a temporary location before replacing the target.
2. Hive ORC Parsing Issues
Empty ORC files cause array‑index or null‑pointer errors. Setting hive.exec.orc.split.strategy=BI avoids null pointers, and disabling vectorized execution with hive.vectorized.execution.enabled=false prevents array‑index errors. For Spark 3.x, also set hive.metastore.dml.events=false to avoid write‑time failures.
3. spark.sql.sources.schema Property
The presence of spark.sql.sources.schema.part in Hive table metadata can cause schema mismatches after DDL changes. Deleting the property or recreating the table resolves the issue; during mixed‑engine periods, DDL should be performed only via Hive.
4. Spark Permission and Auditing
Snowball integrated Apache Kyuubi (with a custom PasswdAuthenticationProvider) for authentication and used Submarine (when available) for Ranger‑based authorization. They patched Spark 3.2.1 to work with Ranger and added auditing via Kyuubi’s event system.
5. Subtle Migration Pitfalls
Differences in date handling, implicit type conversion (requiring spark.sql.storeAssignmentPolicy=LEGACY ), and grouping syntax required query rewrites, e.g., select t1.a from t1 join t2 group by t1.a and proper use of grouping sets .
6. Dynamic Resources and Multi‑Version Compatibility
To run Spark2 and Spark3 concurrently, Yarn’s auxiliary service class‑loader isolation and Spark’s configurable shuffle service ports were used, but XML configs had to be packaged into JARs for recognition.
7. Small File Problem
Spark SQL does not support hive.spark.mergefiles=true . Enabling Adaptive Query Execution (AQE) with tuned parameters reduced the number of output files to a maximum of 200, each around 55 MB.
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.adaptive.advisoryPartitionSizeInBytes=262144000 \
--conf spark.sql.adaptive.maxNumPostShufflePartitions=200 \
--conf spark.sql.adaptive.forceApply=true \
--conf spark.sql.adaptive.coalescePartitions.parallelismFirst=false \
--conf spark.sql.adaptive.coalescePartitions.minPartitionSize=52428800Future Plans
With over 300 daily Spark SQL tasks running stably, the team plans to unify all ETL engines on Spark SQL, extend its use to interactive analytics alongside Trino, and adopt lakehouse technologies such as Apache Hudi for real‑time data needs.
Snowball Engineer Team
Proactivity, efficiency, professionalism, and empathy are the core values of the Snowball Engineer Team; curiosity, passion, and sharing of technology drive their continuous progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.