How EyesTSDB Evolved into a Cloud‑Native, Second‑Level Monitoring Platform
This article details the evolution of NetEase's self‑built time‑series database EyesTSDB into a cloud‑native, second‑level monitoring solution, covering its architecture, core features, integration with VictoriaMetrics, custom plugin workflow, CMDB linkage, real‑world use cases, and future challenges.
1. Overview of the Self‑Developed Time‑Series Database EyesTSDB
EyesTSDB was built to support global, second‑level monitoring for NetEase's gaming services, covering more than ten countries and twenty regions, and handling both public and private cloud environments. It monitors over 1.5 billion aggregated metrics from 15 million entities, including 150 million physical machines and container workloads.
The system provides multiple monitoring modes—basic, network, process, and custom business metrics—through 1,000+ agent plugins and SDKs, with data ingestion via HTTP or other network protocols.
Architecture Design
Monitoring Center : UI, alarm interface, scheduling, data bus, storage, and plugin repository.
Monitoring Region : Region nodes and monitored machines equipped with agents.
Arbiter : Scheduling core that generates machine configurations from CMDB, ensures region health with active‑standby mode.
Data Bus : Collects metrics and forwards them via Kafka.
Data Storage : Implements hot‑cold separation; hot data resides in Redis for six hours, cold data is persisted to MongoDB.
Plugin Repository : GitLab‑backed custom plugins allow users to implement any monitoring logic.
Agent : Collects metrics from plugins and reports to the Region.
When CMDB changes, Arbiter subscribes to updates, generates new configurations, and pushes them to Regions. Agents then query the appropriate Publisher based on the updated configuration.
Core Features
Monitoring Object Design : Entities are abstracted with EntityType and Tag, enabling flexible aggregation and filtering across 500+ entity types and 15 billion time‑series.
Custom Plugin Monitoring (Python) : Users create a plugin via the UI, which auto‑creates a GitLab repository, builds a pip package, and distributes configuration to agents for execution.
Data Bus Flow : Metrics flow into Kafka, undergo cleaning and pre‑aggregation, then are routed to storage, alerting, or other consumers.
Data Storage : Hot storage in Redis (6 h) and cold storage in MongoDB with tiered retention (1 min → 7 days, 5 min → 30 days, etc.). Cold data is periodically archived to HDFS for AI and anomaly‑detection workloads.
Application Scenarios
Machine View : Shows CPU, memory, I/O, and supports grouping by service and cluster.
Process View : Hierarchical tag‑based navigation of processes, highlighting top‑CPU and memory usage.
Custom Dashboards : Users configure entity‑type queries with flexible boolean tag expressions (e.g., (a|b)&c) and aggregation intervals.
Problems & Challenges
Minute‑level granularity cannot capture high‑frequency events such as database operation spikes.
Increasing demand for second‑level granularity in cloud‑native environments.
Python‑based core modules struggle with massive data volumes; partial Go refactoring is ongoing.
2. Achieving Second‑Level Monitoring with Open‑Source Metrics
After evaluating Thanos, native Prometheus, and the limitations of the existing EyesTSDB, the team selected VictoriaMetrics as a cluster‑deployable remote‑write storage that supports second‑level data and multi‑tenant isolation.
Key Characteristics of VictoriaMetrics
Open‑source time‑series database.
High performance and horizontal scalability.
Supports Prometheus remote‑write, InfluxDB line protocol, and multi‑tenant isolation.
Architecture & Data Model
VictoriaMetrics separates data and index directories. The data folder contains big (high‑compression, historical) and small (low‑compression, recent) subfolders. The indexdb stores per‑tenant indices with retention periods ranging from minutes to days.
Data records consist of AccountID, ProjectID, MetricGroupID, JobID, InstanceID, and a monotonically increasing MetricID timestamp.
Integration with Existing System
The new architecture embeds VictoriaMetrics at the core storage layer (highlighted in red in the diagram). Agents write metrics via Remote Write to a transfer service, which forwards data to Kafka, then to a proxy for preprocessing before reaching the alerting system and storage.
Cold data is sharded by metric identifiers, aggregated by vmshuffle, and down‑sampled by vmdownsample into 5‑minute, 30‑minute, and daily resolutions, preserving compatibility with the legacy EyesTSDB.
CMDB Tag Integration
Tags from the internal CMDB are stored separately but indexed alongside metric labels using a dedicated namespace in VictoriaMetrics, enabling PromQL queries that filter on both standard labels and CMDB‑derived tags.
3. Future Outlook
Data Quality : Reduce noise by focusing on the most valuable 20‑30 % of reported metrics.
Business Diversification : Support varied cluster requirements, including deduplication, CMDB linkage, and SaaS offerings.
Agent & Scheduler Performance : Continue refactoring Python components to Go for better scalability.
Open‑Source Contributions : Release stable extensions (e.g., vmshuffle, vmdownsample) and explore automated cluster provisioning and SaaS‑style management.
Overall, the transition from a monolithic Python‑based TSDB to a cloud‑native, VictoriaMetrics‑powered stack enables second‑level observability, multi‑tenant isolation, and seamless integration with NetEase’s CMDB, positioning the platform for future scalability and open‑source collaboration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
