How ByteDance Built a Cloud‑Native Big Data Ops Platform for Unified Logging & Alerts
ByteDance’s cloud‑native big data operations platform consolidates logging, monitoring, and alerting across heterogeneous environments, using unified log collection (intrusive and Filebeat), dynamic alert rules, customizable notification plugins, and scalable monitoring pipelines, thereby reducing operational complexity, shielding users from infrastructure differences, and enhancing multi‑tenant efficiency.
Unified Logging, Monitoring, and Alerting
Cloud‑native big data is the next‑generation architecture for data platforms. As ByteDance’s internal services grew rapidly, traditional big‑data ops platforms showed drawbacks such as numerous components, complex installation, tight coupling with underlying environments, and lack of out‑of‑the‑box logging, monitoring, and alerting for business teams.
To address this, a series of cloud‑native big‑data ops practices were implemented, aiming to reduce business‑side state awareness, hide environment differences, and provide a consistent experience across environments.
Logging
Logging is a major source of portability challenges. A unified log‑collection pipeline was built to achieve business isolation, efficient collection, fair allocation, and reliable security.
Two collection methods are supported:
Intrusive collection : provides collectors for Java and Python; for components preferring file‑based collection, Filebeat is used.
Filebeat collection : in container scenarios, Filebeat collects logs based on custom CRD rules, deployed as a DaemonSet when node information is available, or as a Sidecar injection when not.
Alerting
The alerting system is built on the open‑source Nightingale project, using Prometheus for metric storage and a database for alert business data, with core components WebApi and Server.
WebApi handles user interactions such as rule CRUD and metric queries, while Server loads rules, generates alert events, and sends notifications. Customizations focus on WebApi and Server.
Process Overview
Users create alert rules via WebApi, persisted to a database. Server loads rules into memory, uses consistent hashing to route processing, and evaluates metric queries to detect alerts.
When an alert occurs, the system invokes the appropriate module to send notifications and records the event in the database. Optimizations include unified user system with groups and duty rosters, incremental rule loading, and dynamic message templates supporting various channels (DingTalk, WeChat, Feishu, SMS, etc.).
Dynamic Message Templates
Templates can reference alert event information to assemble rich contextual messages.
Notification methods are implemented as plugins, allowing environment‑specific extensions while keeping core workflow unchanged.
Monitoring
Each cluster runs a Prometheus agent that scrapes component metrics and remote‑writes them to a centralized monitoring system. The backend storage can be a public‑cloud monitoring service, S3, or a custom big‑data store, with a query service providing visualization and front‑end interaction.
Optimizations include horizontal query splitting for large time ranges, caching immutable monitoring data, and pre‑aggregation and down‑sampling to improve performance. The system supports one‑click metric collection across environments, integrates with logging, alerting, and tracing, and offers strong horizontal scalability.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.