Big Data 14 min read

Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience

Facing exploding data volumes and rising cluster costs, HuoLala adopted Apache Tez’s Remote Shuffle Service built on Apache Uniffle, redesigning the Tez client to operate without source modifications, detailing architecture, implementation challenges, testing, stability measures, and future plans to enhance big‑data shuffle performance and cost efficiency.

Huolala Tech

Mar 7, 2024

Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience

Background

Apache Tez

is an open‑source compute framework that can combine multiple dependent jobs into a single job, improving performance. At HuoLala, Tez serves as the primary offline compute engine. As data volume grew and cluster costs surged, the company introduced the Remote Shuffle Service (RSS) to achieve cost reduction and support cloud‑native, storage‑compute separation.

Industry Research

Since Uber open‑sourced its RSS solution in 2020, many mature designs have emerged. After extensive evaluation of universality, performance, stability, and community activity, HuoLala selected the Apache Uniffle RSS architecture and built a Tez client on top of it.

Uniffle requires no invasive changes to Tez source: its partition‑block ordering simplifies reduce‑side sorting, whereas Alibaba's RSS would need extensive Tez code modifications.

The Uniffle community is highly active, offering fast feature iteration and bi‑weekly meetings. In addition to a Spark client, it provides an MR client, which is useful for developing a Tez client.

More details: Uniffle: Cloud‑Native Shuffle Era

Apache Uniffle in HuoLala

Uniffle already provides Shuffle clients for Spark and MR, but not for Tez. HuoLala designed and implemented a Tez client together with partners (e.g., Shein), contributing several patches to improve the Uniffle server.

Tez’s Shuffle implementation lacks a pluggable ShuffleManager interface and its DAG model (multiple Map and Reduce stages) is more complex than the simple Map‑Reduce model, making Shuffle management harder.

To address this, HuoLala compared Tez, MR, and Spark local Shuffle implementations and designed the Tez client around three modules: ApplicationMaster, Shuffle Write, and Shuffle Read.

+----------------+----------------------+----------------------+----------------------+
|                | Application Master   | Shuffle Write        | Shuffle Read         |
+----------------+----------------------+----------------------+----------------------+
| Tez            | - Handles multiple   | - Shuffle class is   | - No abstract Shuffle |
|                |   SQL sessions via   |   tightly coupled to |   interface; uses     |
|                |   DAG                |   Tez events         |   RPC to fetch data  |
+----------------+----------------------+----------------------+----------------------+
| MR             | - Handles single SQL | - Configurable Shuffle| - Configurable Shuffle|
+----------------+----------------------+----------------------+----------------------+
| Spark          | - Handles multiple   | - Plugin‑based Shuffle| - Plugin‑based Shuffle|
|                |   SQL sessions via   |   (only PartitionId) |   with global ordering|
+----------------+----------------------+----------------------+----------------------+

Tez Client Architecture

ApplicationMaster (AM)

Module Function: Interacts with the Uniffle Coordinator to register the application and request worker resources; assigns workers to Map/Reduce tasks for Shuffle data write and read.

Technical Challenges:

Achieve zero modifications to Tez source code.

Pass allocated worker information to Map/Reduce tasks.

Solution:

Introduce RssDAGAppMaster extending Tez’s AM class to handle interaction with the Uniffle Coordinator. By adjusting parameters, replace the hard‑coded DAGAppMaster with RssDAGAppMaster as a regular argument.

Start a new RPC service inside RssDAGAppMaster to serve worker‑allocation requests from Map/Reduce tasks, passing the RPC address via overloaded DAG initialization events.

Shuffle Write

Module Function: Write Shuffle data to Uniffle Worker servers.

Technical Challenges: Tez Shuffle code is tightly coupled with other Tez components, making it hard to redirect output without source changes.

Implementation: Three Write classes were created— RssOrderedPartitionedKVOutput, RssUnorderedPartitionedKVOutput, and RssUnorderedKVOutput —all extending AbstractLogicalOutput and implementing the required start and getWriter methods. The writer buffers K‑V pairs, converts them into Uniffle Blocks when a threshold is reached, and sends them to the server.

Shuffle Read

Module Function: Request worker addresses from the Uniffle Coordinator, fetch Shuffle data from workers or remote HDFS, sort as needed, and deliver to Reducers.

Technical Challenges:

High coupling with Tez code; need minimal intrusion while reusing existing event, reader, and merge mechanisms.

Tez Reduce may pull data from multiple upstream partitions simultaneously.

Maintain compatibility with existing modules.

Solution: Add a new RPC interface on the AM side for Shuffle Read to obtain worker addresses. Implement RssOrderedGroupedKVInput and RssUnorderedKVInput that use a custom RssRunShuffleCallable to fetch data at the partition level, integrating with existing event handling and merge logic.

Stability Assurance

Tez Client Testing

Functional testing ensured tasks complete without hangs or OOM.

Data‑quality testing used a dual‑run tool ( Remote Shuffle VS Local Shuffle) to compare query results.

Performance testing showed no significant slowdown compared to local Shuffle.

Uniffle Service Stability

Monitoring and alerting were added for off‑heap memory growth and long‑running RPC calls.

Chaos testing simulated worker/coordinator failures, leading to fast‑fail mechanisms and asynchronous metric reporting.

Gradual gray‑deployment was performed, progressing from non‑core to core queues.

Effect

During gray‑deployment, Tez on RSS showed no stability, functionality, or data‑quality issues, and task speed remained stable.

RSS service disk usage increased while compute‑node disk usage decreased, allowing cost reduction by using nodes with lower disk capacity.

RSS can improve HQL efficiency; future work will focus on performance bottlenecks and further cost savings.

Community Contribution

The Tez‑on‑Uniffle code, after passing quality checks, was submitted to the Uniffle community, along with patches for fast‑fail, DAG‑level data deletion, and enhanced monitoring.

Future Plans

Gradually replace all local Tez Shuffle with Remote Shuffle, including core queues.

Write Shuffle data to object storage.

Enable stage‑failure recovery without re‑execution.

Continue active collaboration with the Uniffle community and maintain the Tez client module.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Big Data Shuffle Remote Shuffle Service Uniffle Apache Tez

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.