Design and Architecture of Jarvis: A DAG‑Based Big Data Scheduling Platform
The article describes the design goals, architecture, and key components of Jarvis, an internal DAG‑driven job scheduling platform for big‑data pipelines, covering timed‑shard and workflow schedulers, high‑availability mechanisms, task development for Hive and data‑transfer jobs, dependency handling, APIs, monitoring, and future enhancements.
Jarvis is an internal big‑data scheduling platform designed to bridge the gap between users and Hadoop‑ecosystem components such as Hive, Spark, HBase, and others, providing a flexible DAG‑based workflow scheduler.
The system distinguishes two main scheduler types: timed‑shard schedulers (e.g., TBSchedule, SchedulerX, Elastic‑job, Saturn) that focus on evenly distributing job shards, and DAG workflow schedulers (e.g., Oozie, Azkaban, Chronos, Zeus, Lhotse) that emphasize correct handling of complex job dependencies.
Design goals include high availability, modular componentization, separation of scheduling and service layers, multi‑cycle scheduling, rich job types, extensible dependency models, open APIs, streamlined operations, and comprehensive monitoring and alerting.
Jarvis’s core architecture consists of five parts: quartz for job triggering, workflow job context storing execution metadata, workflow project handler managing job lifecycle, workflow node handler interfacing with underlying big‑data services via gRPC, and big data services exposing RESTful APIs.
The platform supports both periodic and ad‑hoc (non‑periodic) jobs, with periodic jobs managed by Quartz on hourly, daily, weekly, or monthly schedules, and allows tasks to be paused or prioritized based on business rules.
Task development examples include Hive jobs—where scripts are version‑controlled, validated, and support time variables—and data‑transfer jobs built on an internal XDATA framework inspired by Alibaba DataX, offering configurable readers and writers.
Templates, implemented with Velocity scripts, enable rapid creation of repetitive tasks by abstracting common parameters.
Dependency handling relies on instance IDs generated from project ID, schedule type, and business date; the system supports three instance types (periodic, manual, back‑fill) and provides various failure‑recovery strategies such as manual rerun, selective retry, custom execution, and batch operations.
Workflow dependencies are managed through project‑level permissions, timeout settings, soft‑dependency skipping, and priority‑based alerting (must, high, medium, low) via phone, SMS, WeChat, or email.
Jarvis exposes APIs for project management, instance control, and log retrieval, and integrates with monitoring stacks (Prometheus, dashboards) to track metrics like running tasks, failures, and resource usage.
Future plans include enhancing dependency logic with field‑level lineage, adding shell and Python plugins, and improving health‑check granularity.
Overall, Jarvis draws inspiration from Alibaba DataWorks, combines open‑source components with custom development, and demonstrates that a well‑designed, scenario‑driven scheduling architecture is essential for reliable big‑data pipeline execution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tongcheng Travel Technology Center
Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
