Big Data 19 min read

How Berserker’s Big Data Platform Solved Scheduling, State and Scaling Challenges

This article details the architecture, evolution, and technical solutions of the Berserker big‑data platform—including component design, state‑management problems, release strategies, two‑phase commit, RPC handling, routing, message queuing, containerized execution, dependency model redesign, and future roadmap—demonstrating how the system achieved high availability, low latency, and scalable operations.

ITPUB
ITPUB
ITPUB
How Berserker’s Big Data Platform Solved Scheduling, State and Scaling Challenges

1. Platform Overview

The Berserker (project code "Berserker – 狂战士") platform is a one‑stop data development and governance solution built on big‑data ecosystem components, covering data collection, transmission, storage, querying, development, analysis, mining, testing, execution, and operations. It serves various internal roles such as analysts, product owners, and operators, providing data maps, ad‑hoc queries, ETL, reporting tools, data integration, quality, asset management, and API services.

2. Data Development Architecture

The data‑development subsystem includes offline batch scheduling, real‑time stream computing, ETL, ad‑hoc queries, user APIs, and an operations center. It currently runs >40 micro‑services, handling >150k offline tasks, >250k daily routine tasks, >10k task chains, and >4k streaming tasks.

The internal scheduler, code‑named Archer , consists of:

CN (Control Node) : handles timing, dependencies, rate‑limiting, routing, submission, and cluster management; also the client side of communication.

EN (Execute Node) : receives tasks from CN, runs them locally or on a cluster, and reports status back to CN; server side of communication.

API : web‑level task management and external API layer.

SqIScan : SQL parsing and compilation service.

DataManager : manages task execution clusters and cross‑datacenter data replication.

Blackhole : unified Kerberos authentication.

Admin : UI for configuring rate‑limiting, routing policies, and EN management.

3. Key Issues and Solutions

3.1 State Management Problems

Both CN and EN maintain state; crashes or restarts often caused tasks to hang, duplicate execution, or miss submission (Misfire). Early design used Zookeeper for CN leader election and stored CN state in Redis, with bidirectional heartbeats for CN‑EN status. Problems included Zookeeper instability, network partitions, inconsistent Redis state, and stateless NIO heartbeats.

Solution: replace Zookeeper with an embedded Raft library, moving CN state into a Raft state machine. Raft provides strong consistency, eliminating split‑brain scenarios. Recovery time is now <3.5 s (Raft election timeout + log apply + DB restore). Raft logs store task instance state and scheduling info, while other data remain in the database. EN‑CN status communication switched to the internal bilibili‑remoting framework, which embeds node‑status management and removes third‑party health‑check dependencies.

3.2 EN Release Issues

Previously EN releases killed running tasks, forcing CN to resubmit them—wasting resources and delaying long‑running jobs. The new smooth‑release flow:

Release system notifies CN to stop sending tasks to the EN being upgraded.

Running tasks on the EN are polled until they finish (may take a long time).

After all tasks complete, the EN is upgraded and CN resumes task submission.

3.3 Two‑Phase Commit for Tasks

During Raft leader switches, a task could be marked DISPATCHED in the state machine while the RPC to EN had not completed, leading to lost submissions. The new protocol records START_DISPATCH before RPC, switches to END_DISPATCH after successful RPC, and includes a recovery step that queries EN for START_DISPATCH tasks to retry if needed.

3.4 RPC Duplicate Submissions

Duplicate RPC submissions were mitigated by requiring EN ACKs. On timeout, CN queries EN for task status; repeated queries continue until a definitive answer or EN crash is detected.

3.5 Routing and Gray‑Release

Routing rules now support >50 attribute combinations per task, allowing fine‑grained control over machine/cluster selection, execution image, and gray‑release policies (e.g., first run uses gray strategy, fallback on failure).

3.6 Message Overload (SmartQueue)

EN reports task progress frequently, overwhelming CN’s database ingestion. A SmartQueue with high/low watermarks and mergeable messages was introduced. Mergeable messages (same JobHistory) overwrite previous progress, reducing storage and processing load.

3.7 Execution Management via Docker

Tasks are now executed either as local workers or Docker containers. DockerD, EN, and a LogAgent form a three‑component stack: EN launches containers based on task metadata, mounts a context directory, and streams logs to LogAgent, which forwards them to Elasticsearch. Kerberos tickets are passed via mounted files for secure access.

3.8 Dependency Model Refactor (Project → Task)

The original project‑based model limited cross‑project table dependencies. Introducing explicit root and end nodes allowed conversion of project dependencies into task dependencies, enabling zero‑risk migration and more accurate DAG construction.

3.9 Big‑Data Operations

When upstream ODS data quality degrades, the platform provides both real‑time (night‑time) and post‑run (day‑time) remediation workflows, including data blocking, targeted repairs, and downstream task re‑execution. Query capabilities support multi‑dimensional filtering, and one‑click actions enable retry, block, or backfill operations.

4. Future Work

Planned improvements include:

EN Stateless Design : Remove EN state entirely, cache only, and source status from DockerD to shorten release windows.

Kubernetes Integration : Migrate CN/EN functionalities to K8s, allowing EN workers to join a shared cluster pool for better resource utilization.

Unified Real‑Time & Batch Platform : Consolidate streaming and batch execution under a common scheduling, resource‑management, and SQL‑parsing layer, exposing a single web entry point and unified execution protocol.

These efforts aim to further improve availability (target 4 9’s), reduce scheduling latency (<5 s 99th percentile), and enhance scalability and cost efficiency across the platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerState ManagementKubernetestask schedulingData PlatformRaft
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.