Big Data 10 min read

How ByteDance Built a Cloud‑Native Big Data Ops Platform for Unified Logging & Alerts

ByteDance’s cloud‑native big data operations platform consolidates logging, monitoring, and alerting across heterogeneous environments, using unified log collection (intrusive and Filebeat), dynamic alert rules, customizable notification plugins, and scalable monitoring pipelines, thereby reducing operational complexity, shielding users from infrastructure differences, and enhancing multi‑tenant efficiency.

Efficient Ops

Jul 5, 2023

How ByteDance Built a Cloud‑Native Big Data Ops Platform for Unified Logging & Alerts

Unified Logging, Monitoring, and Alerting

Cloud‑native big data is the next‑generation architecture for data platforms. As ByteDance’s internal services grew rapidly, traditional big‑data ops platforms showed drawbacks such as numerous components, complex installation, tight coupling with underlying environments, and lack of out‑of‑the‑box logging, monitoring, and alerting for business teams.

To address this, a series of cloud‑native big‑data ops practices were implemented, aiming to reduce business‑side state awareness, hide environment differences, and provide a consistent experience across environments.

Logging

Logging is a major source of portability challenges. A unified log‑collection pipeline was built to achieve business isolation, efficient collection, fair allocation, and reliable security.

Two collection methods are supported:

Intrusive collection : provides collectors for Java and Python; for components preferring file‑based collection, Filebeat is used.

Filebeat collection : in container scenarios, Filebeat collects logs based on custom CRD rules, deployed as a DaemonSet when node information is available, or as a Sidecar injection when not.

Alerting

The alerting system is built on the open‑source Nightingale project, using Prometheus for metric storage and a database for alert business data, with core components WebApi and Server.

WebApi handles user interactions such as rule CRUD and metric queries, while Server loads rules, generates alert events, and sends notifications. Customizations focus on WebApi and Server.

Process Overview

Users create alert rules via WebApi, persisted to a database. Server loads rules into memory, uses consistent hashing to route processing, and evaluates metric queries to detect alerts.

When an alert occurs, the system invokes the appropriate module to send notifications and records the event in the database. Optimizations include unified user system with groups and duty rosters, incremental rule loading, and dynamic message templates supporting various channels (DingTalk, WeChat, Feishu, SMS, etc.).

Dynamic Message Templates

Templates can reference alert event information to assemble rich contextual messages.

Notification methods are implemented as plugins, allowing environment‑specific extensions while keeping core workflow unchanged.

Monitoring

Each cluster runs a Prometheus agent that scrapes component metrics and remote‑writes them to a centralized monitoring system. The backend storage can be a public‑cloud monitoring service, S3, or a custom big‑data store, with a query service providing visualization and front‑end interaction.

Optimizations include horizontal query splitting for large time ranges, caching immutable monitoring data, and pre‑aggregation and down‑sampling to improve performance. The system supports one‑click metric collection across environments, integrates with logging, alerting, and tracing, and offers strong horizontal scalability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.