Big Data 15 min read

Berserker Big Data Platform: Architecture, Development Practices, and Operational Enhancements

This article presents a comprehensive overview of the Berserker big‑data platform, detailing its overall design, data‑development components, key architectural challenges such as state management, release processes, two‑phase commit, RPC duplication, task routing, message handling, execution isolation, dependency model redesign, and outlines future work including stateless execution nodes, Kubernetes integration, and unified stream‑batch processing.

DataFunTalk
DataFunTalk
DataFunTalk
Berserker Big Data Platform: Architecture, Development Practices, and Operational Enhancements

Platform Overview

The Berserker platform is a one‑stop data development and governance system built on big‑data ecosystem components, supporting data collection, transmission, storage, querying, development, analysis, mining, testing, execution, and operations for various internal roles.

Data Development

Core functionalities include offline batch scheduling, real‑time stream computing, ETL development, ad‑hoc queries, user APIs, and an operations center. The platform runs over 40+ micro‑services, using the Kratos framework for the micro‑service layer.

Architecture and Core Components

The scheduling system (project code‑named Archer) consists of Control Node (CN) for scheduling control, Execute Node (EN) for task execution, API services, SqIScan for SQL parsing, DataManager for IDC management, Blackhole for Kerberos authentication, and Admin console for configuration.

Key Challenges and Solutions

State Issues: Transitioned from Zookeeper and Redis to Raft for strong consistency, eliminating split‑brain and state loss problems.

EN Release Problems: Implemented a smooth release workflow that pauses task submission, waits for running tasks to finish, then resumes submission.

Two‑Phase Commit: Added explicit START_DISPATCH and END_DISPATCH states in Raft to ensure atomicity between state changes and RPC calls.

RPC Duplication: Introduced EN ACK mechanism and timeout‑based re‑checks to avoid duplicate task submissions.

Task Routing & Gray‑Release: Developed a rule‑engine‑driven routing system supporting 50+ attribute combinations, tag‑based machine/cluster selection, and gray‑release fallback.

Message Overload: Designed a SmartQueue with high/low watermarks and merge‑able messages to handle massive task‑status streams.

Execution Management: Adopted Docker containers with Dockerd and LogAgent for isolation, resource control, and unified logging.

Dependency Model Refactor: Replaced project‑level dependencies with root/end node task dependencies, enabling zero‑risk migration.

Big‑Data Operations: Built lightweight, fast‑response tools for incident handling, data repair, and batch re‑runs, supporting both real‑time and post‑mortem scenarios.

Future Work

Stateless EN: Move EN state to DockerD cache, simplifying releases.

Kubernetes Support: Migrate CN/EN functionalities to K8s for better resource utilization.

Unified Stream‑Batch Platform: Consolidate offline and real‑time processing under a common scheduling and execution framework.

DockerBig Datadistributed schedulingtask managementdata platformRaft
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.