How Voidbox Bridges Docker and YARN for Scalable Big Data Workloads
Voidbox integrates Docker containers with YARN to simplify distributed application development, improve deployment, boost cluster efficiency, and provide fault‑tolerant, DAG‑based execution modes, enabling seamless resource management for Hadoop‑based big data jobs.
Voidbox Motivation
YARN is Hadoop 2.0’s distributed resource manager that schedules cluster resources for applications such as MapReduce and Spark, but existing frameworks assume specific environments. Docker offers a solution by providing isolated containers that can run any application.
Voidbox was created by Hulu’s engineering team to combine Docker’s advantages with YARN, allowing any Docker‑encapsulated application to run on a YARN cluster alongside MapReduce and Spark.
Ease of creating distributed applications : Voidbox handles common issues like cluster discovery, elastic resource allocation, task coordination, and disaster recovery, offering a simple interface for developers.
Simplified deployment : Instead of maintaining large VM images, Voidbox allocates resources on demand, eliminating extra maintenance work.
Improved cluster efficiency : Co‑running Spark/MR and Voidbox applications maximizes cluster utilization.
Voidbox also supports Docker‑based DAG tasks, integrates with Jenkins, GitLab, and private Docker registries for automated testing, packaging, and release.
Voidbox Architecture
YARN Architecture Overview
YARN enables multiple applications to share cluster resources dynamically.
A client submits a job to the Resource Manager, which schedules resources. The Application Master handles task scheduling and execution.
Resource Manager: cluster‑wide resource management and scheduling.
NodeManager: runs on compute nodes, executes tasks, sends heartbeats.
Application Master: requests resources from YARN and launches containers.
Container: abstraction of memory, CPU, disk, network, etc.
HDFS: distributed file system.
Voidbox Architecture Design
YARN manages cluster resources, Docker provides the execution engine, and Voidbox translates user code into Docker‑based DAG tasks, requests resources, and manages execution.
Voidbox Modules:
Client: submits, stops, and manages Voidbox applications.
Master: YARN Application Master that requests resources for Docker tasks.
Driver: schedules DAG tasks and runs user code.
Proxy: bridges YARN and Docker engine, handling start/stop commands.
State Server: tracks Docker engine health and available machines.
Docker Modules:
Registry: stores Docker images.
Engine: runs containers from images.
Jenkins: automates testing, packaging, and image publishing.
Running Mode
Voidbox offers two modes:
yarn‑cluster mode: control and resource components run inside the YARN cluster; the client can exit after submission, suitable for production.
yarn‑client mode: control runs on the client, providing detailed logs; exiting the client stops the application, useful for debugging.
Running Procedure
Develop a Voidbox application with the SDK, package it as a JAR, and submit via Voidbox Client.
Resource Manager allocates resources for Voidbox Master and launches it.
Voidbox Master starts the Driver, which decomposes the application into Docker jobs and launches tasks.
Voidbox Master requests YARN containers; Voidbox Proxy communicates with Docker Engine to start containers.
Docker tasks run inside containers; logs are written locally and viewable via the YARN Web Portal.
After completion, logs are aggregated to HDFS for historical access.
Docker Integration with YARN Resource Management
YARN provides uniform resource management, while Docker also manages resources at the container level. Voidbox introduces a Proxy layer so YARN can control Docker containers, preventing resource leaks and ensuring unified scheduling.
When a Voidbox application is killed, YARN sends a kill signal, the Proxy intercepts it, and stops the Docker container, recycling resources.
Fault Tolerance
Master fault tolerance: Resource Manager restarts the Master if it crashes.
Proxy fault tolerance: Master recycles Docker containers if the Proxy fails.
Docker container fault tolerance: Applications can set retry limits; the Master handles container exit codes.
Programming Model
DAG Programming Model
Voidbox provides a Docker‑based DAG model. Example diagram:
Four jobs are defined; each can specify CPU, memory, image, and parallelism. Job3 starts after Job1 and Job2 complete, allowing user code insertion before Job4.
Shell Mode to Submit One Task
For single‑task execution without programming, Voidbox offers a shell mode:
docker-submit.sh \
-docker_image centos \
-shell_command "echo Hello Voidbox" \
-container_memory 1000 \
-cpu_shares 2This submits a task that runs “echo Hello Voidbox” in the centos image with 1000 MB memory and 2 CPU shares.
Voidbox in Action
Voidbox runs Docker, MapReduce, Spark, and other applications on Hulu’s YARN cluster, supporting automation testing, complex parallel tasks, and workflow building.
Automation testing: Jenkins, GitLab, and private Docker registry automate build, test, and release.
Complex parallel tasks: Multi‑layer Docker images and scheduled Voidbox applications handle dependencies and parallelism.
Workflow building: Container‑based programming model manages dependent steps such as loading user behavior data before analysis.
Differences from DockerContainerExecutor in YARN 2.6.0
DockerContainerExecutor is an alpha feature, difficult to coexist with other executors.
Voidbox offers a DAG model, configurable fault tolerance, multiple running modes, shared YARN resources, and a graphical log view.
Future Work
Support additional YARN versions.
Persist metadata to reduce retry costs for Master failures.
Provide a permanent Voidbox Master service for streaming tasks.
Enable long‑running services without downtime impact.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Hulu Beijing
Follow Hulu's official WeChat account for the latest company updates and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
