Big Data 11 min read

How We Built the Mars Big Data Platform to Boost Development Efficiency

The article explains why Weidian needed a new big data development platform, outlines the functional features of the Mars system, describes its architecture, scheduling mechanisms, task execution flow, and discusses remaining challenges and future enhancements.

Weidian Tech Team
Weidian Tech Team
Weidian Tech Team
How We Built the Mars Big Data Platform to Boost Development Efficiency

“To do a good job, one must first sharpen the tools,” a classic saying reminds us that proper preparation is essential for success. This article shares the background, features, and architecture of Weidian's Mars big data development platform.

1. Why a big data development platform is needed

Before April 2016, developers logged into a gateway machine with Hive and Hadoop clients, wrote scripts, scheduled them via crontab, and manually synchronized code to Git. This approach suffered from low efficiency, lack of version control, poor testing, no permission management, unreliable scheduling, no failure alerts, and a single point of failure on the gateway.

2. Required functional features

Version control for easy rollback.

Standardized development, testing, and deployment processes.

Permission control so only owners or admins can operate tasks.

Dependency scheduling that automatically triggers a task after all its upstream tasks succeed.

Failure notifications to task owners for manual intervention.

Manual recovery with automatic downstream re‑execution.

Task dependency graph with color‑coded success/failure.

Distributed storage so a single machine failure does not affect the platform.

Input/output validation to ensure tables are ready and results are complete.

Controlled Hadoop resource usage per team queue.

3. Composition of the Mars platform

The platform builds on HDFS, YARN, and HiveMeta services, supports data discovery via Hive and Kylin, and will integrate ES, Redis, Hades, etc. It currently handles Shell, Hive, MapReduce, and Spark task types.

4. System architecture design

All Mars nodes run Hadoop and Hive clients. Each node has a master and worker role. The master manages jobs and assigns them to workers, while workers execute jobs and report logs.

5. Distributed system architecture

Master/Worker election – Nodes compete to become master by inserting a record in the database; the master periodically updates this record. Workers monitor the timestamp and elect a new master if the current one becomes stale.

Master‑worker failover – Communication uses protobuf over Netty. The master regularly checks worker connections; if a worker fails, its tasks are reassigned. If the master fails, a worker takes over, aborts running jobs, and redistributes them.

6. Timed and dependency scheduling

Timed scheduling – Implemented with Quartz, a Java‑based job scheduler. Developers implement the org.quartz.Job interface and define the execute method containing business logic.

Dependency scheduling – A job is triggered only after all its dependent jobs have successfully completed. The platform tracks ready and pending dependencies and fires the job when all prerequisites are met.

7. What happens during task execution

Backend

User initiates execution from the development or scheduling center.

Worker performs pre‑checks (permissions, parameter substitution).

Worker sends execution request to the master.

Master queues the task.

Master selects an appropriate worker (load balancing) to run the task.

Preparation steps such as resource download and data table monitoring.

Task execution.

Post‑processing including data drift checks, old partition cleanup, and success flagging.

If result files are generated, they are uploaded to Hadoop for later download.

Frontend

User triggers the task.

Frontend periodically polls logs and updates the UI.

Task completion notification.

If results exist, the frontend fetches them from Hadoop.

Data visualization.

8. Legacy issues and future directions

Legacy issues:

Undetected tasks when the master is down or during deployment.

Re‑deployment may cause long‑running tasks to be cancelled and reassigned, which is undesirable for lengthy jobs.

Data quality checks are limited to size; deeper content validation is needed.

Future development:

Resource billing to standardize Hadoop usage.

Data catalog for easier data discovery.

Lineage tracking to trace data origins.

Data flow integration for seamless data exchange.

We welcome passionate engineers to join the Weidian team (Hangzhou: join‑[email protected], Beijing: join‑[email protected]).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed Systemstask schedulingplatform architectureHadoop
Weidian Tech Team
Written by

Weidian Tech Team

The Weidian Technology Platform is an open hub for consolidating technical knowledge. Guided by a spirit of sharing, we publish diverse tech insights and experiences to grow and look ahead together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.