Big Data 13 min read

Migrating from Hive3 on Tez to Spark SQL: Practices, Challenges, and Performance Evaluation

This article details the Snowball data team's migration from Hive3 on Tez to Spark SQL, covering the motivations, comparative performance tests, encountered compatibility issues, configuration work‑arounds, and future plans for consolidating ETL workloads on Spark.

Snowball Engineer Team

Apr 21, 2022

Migrating from Hive3 on Tez to Spark SQL: Practices, Challenges, and Performance Evaluation

Background

Due to business needs, the Snowball data team built a new HDP 3.1.5 cluster (Hadoop 3.1.1 + Hive 3.1.0 + Tez 0.9.1) and initially used Hive3 on Tez as the ETL engine. They encountered frequent SQL failures, incorrect CBO plans, timezone bugs, inaccurate results for complex multi‑join queries, and other stability problems.

Because most companies still use Hive2 and Hive/Tez development is slow, the team evaluated Hive3 on Tez, Hive3 on MR, Hive3 on Spark2, and Spark SQL, finally choosing Spark SQL for its accuracy, stability, and efficiency.

Migration Process

Following Facebook's approach, a shadow execution was added: Hive SQL logs are replayed as Spark SQL, writing to shadow tables for side‑by‑side comparison of correctness and performance. Snowball built similar tools, leveraging their Yarn‑based scheduler to capture execution time and resource usage, and a Trino‑based correctness checker.

Two‑stage testing showed Hive3 on Tez achieved ~50% correctness, Hive3 on MR ~70%, Hive3 on Spark2 (with CBO disabled) 100%, and Spark SQL 100%.

Hive3 on MR

Hive3 on Tez

Hive3 on Spark2 (CBO off)

Spark SQL

Hive on Spark2 (sec)

Spark SQL (sec)

Reduction

70%

50%

100%

1957

423

88%

In real‑world workloads, Spark SQL matched Hive3 on Tez in execution time but used only one‑third of the resources, while Hive3 on Spark2 suffered from data skew.

After careful evaluation, the team first switched to Hive3 on Spark2 as an interim solution, then adopted Spark 3.2.1 for long‑term support, gradually migrating Hive SQL to Spark SQL.

Encountered Issues and Solutions

1. Recursive Sub‑directory and Self‑Read Problems

Hive supports set hive.mapred.supports.subdirectories=true; and

set mapreduce.input.fileinputformat.input.dir.recursive=true;

, but Spark SQL does not. Setting spark.sql.hive.convertMetastoreOrc=false forces Spark to use Hive’s parser, enabling recursive directory handling.

When reading and overwriting the same Hive table, Spark throws “Cannot overwrite a path that is also being read from”. Using Hive’s parser resolves this because Hive writes to a temporary location before replacing the target.

2. Hive ORC Parsing Issues

Empty ORC files cause array‑index or null‑pointer errors. Setting hive.exec.orc.split.strategy=BI avoids null pointers, and disabling vectorized execution with hive.vectorized.execution.enabled=false prevents array‑index errors. For Spark 3.x, also set hive.metastore.dml.events=false to avoid write‑time failures.

3. spark.sql.sources.schema Property

The presence of spark.sql.sources.schema.part in Hive table metadata can cause schema mismatches after DDL changes. Deleting the property or recreating the table resolves the issue; during mixed‑engine periods, DDL should be performed only via Hive.

4. Spark Permission and Auditing

Snowball integrated Apache Kyuubi (with a custom PasswdAuthenticationProvider) for authentication and used Submarine (when available) for Ranger‑based authorization. They patched Spark 3.2.1 to work with Ranger and added auditing via Kyuubi’s event system.

5. Subtle Migration Pitfalls

Differences in date handling, implicit type conversion (requiring spark.sql.storeAssignmentPolicy=LEGACY), and grouping syntax required query rewrites, e.g., select t1.a from t1 join t2 group by t1.a and proper use of grouping sets.

6. Dynamic Resources and Multi‑Version Compatibility

To run Spark2 and Spark3 concurrently, Yarn’s auxiliary service class‑loader isolation and Spark’s configurable shuffle service ports were used, but XML configs had to be packaged into JARs for recognition.

7. Small File Problem

Spark SQL does not support hive.spark.mergefiles=true. Enabling Adaptive Query Execution (AQE) with tuned parameters reduced the number of output files to a maximum of 200, each around 55 MB.

--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.adaptive.advisoryPartitionSizeInBytes=262144000 \
--conf spark.sql.adaptive.maxNumPostShufflePartitions=200 \
--conf spark.sql.adaptive.forceApply=true \
--conf spark.sql.adaptive.coalescePartitions.parallelismFirst=false \
--conf spark.sql.adaptive.coalescePartitions.minPartitionSize=52428800

Future Plans

With over 300 daily Spark SQL tasks running stably, the team plans to unify all ETL engines on Spark SQL, extend its use to interactive analytics alongside Trino, and adopt lakehouse technologies such as Apache Hudi for real‑time data needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Big Data Data Warehouse Hive ETL Spark SQL

Written by

Snowball Engineer Team

Proactivity, efficiency, professionalism, and empathy are the core values of the Snowball Engineer Team; curiosity, passion, and sharing of technology drive their continuous progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.