Operations 21 min read

How EyesTSDB Evolved into a Cloud‑Native, Second‑Level Monitoring Platform

This article details the evolution of NetEase's self‑built time‑series database EyesTSDB into a cloud‑native, second‑level monitoring solution, covering its architecture, core features, integration with VictoriaMetrics, custom plugin workflow, CMDB linkage, real‑world use cases, and future challenges.

dbaplus Community
dbaplus Community
dbaplus Community
How EyesTSDB Evolved into a Cloud‑Native, Second‑Level Monitoring Platform

1. Overview of the Self‑Developed Time‑Series Database EyesTSDB

EyesTSDB was built to support global, second‑level monitoring for NetEase's gaming services, covering more than ten countries and twenty regions, and handling both public and private cloud environments. It monitors over 1.5 billion aggregated metrics from 15 million entities, including 150 million physical machines and container workloads.

The system provides multiple monitoring modes—basic, network, process, and custom business metrics—through 1,000+ agent plugins and SDKs, with data ingestion via HTTP or other network protocols.

Architecture Design

Monitoring Center : UI, alarm interface, scheduling, data bus, storage, and plugin repository.

Monitoring Region : Region nodes and monitored machines equipped with agents.

Arbiter : Scheduling core that generates machine configurations from CMDB, ensures region health with active‑standby mode.

Data Bus : Collects metrics and forwards them via Kafka.

Data Storage : Implements hot‑cold separation; hot data resides in Redis for six hours, cold data is persisted to MongoDB.

Plugin Repository : GitLab‑backed custom plugins allow users to implement any monitoring logic.

Agent : Collects metrics from plugins and reports to the Region.

When CMDB changes, Arbiter subscribes to updates, generates new configurations, and pushes them to Regions. Agents then query the appropriate Publisher based on the updated configuration.

Core Features

Monitoring Object Design : Entities are abstracted with EntityType and Tag, enabling flexible aggregation and filtering across 500+ entity types and 15 billion time‑series.

Custom Plugin Monitoring (Python) : Users create a plugin via the UI, which auto‑creates a GitLab repository, builds a pip package, and distributes configuration to agents for execution.

Data Bus Flow : Metrics flow into Kafka, undergo cleaning and pre‑aggregation, then are routed to storage, alerting, or other consumers.

Data Storage : Hot storage in Redis (6 h) and cold storage in MongoDB with tiered retention (1 min → 7 days, 5 min → 30 days, etc.). Cold data is periodically archived to HDFS for AI and anomaly‑detection workloads.

Application Scenarios

Machine View : Shows CPU, memory, I/O, and supports grouping by service and cluster.

Process View : Hierarchical tag‑based navigation of processes, highlighting top‑CPU and memory usage.

Custom Dashboards : Users configure entity‑type queries with flexible boolean tag expressions (e.g., (a|b)&c) and aggregation intervals.

Problems & Challenges

Minute‑level granularity cannot capture high‑frequency events such as database operation spikes.

Increasing demand for second‑level granularity in cloud‑native environments.

Python‑based core modules struggle with massive data volumes; partial Go refactoring is ongoing.

2. Achieving Second‑Level Monitoring with Open‑Source Metrics

After evaluating Thanos, native Prometheus, and the limitations of the existing EyesTSDB, the team selected VictoriaMetrics as a cluster‑deployable remote‑write storage that supports second‑level data and multi‑tenant isolation.

Key Characteristics of VictoriaMetrics

Open‑source time‑series database.

High performance and horizontal scalability.

Supports Prometheus remote‑write, InfluxDB line protocol, and multi‑tenant isolation.

Architecture & Data Model

VictoriaMetrics separates data and index directories. The data folder contains big (high‑compression, historical) and small (low‑compression, recent) subfolders. The indexdb stores per‑tenant indices with retention periods ranging from minutes to days.

Data records consist of AccountID, ProjectID, MetricGroupID, JobID, InstanceID, and a monotonically increasing MetricID timestamp.

Integration with Existing System

The new architecture embeds VictoriaMetrics at the core storage layer (highlighted in red in the diagram). Agents write metrics via Remote Write to a transfer service, which forwards data to Kafka, then to a proxy for preprocessing before reaching the alerting system and storage.

Cold data is sharded by metric identifiers, aggregated by vmshuffle, and down‑sampled by vmdownsample into 5‑minute, 30‑minute, and daily resolutions, preserving compatibility with the legacy EyesTSDB.

CMDB Tag Integration

Tags from the internal CMDB are stored separately but indexed alongside metric labels using a dedicated namespace in VictoriaMetrics, enabling PromQL queries that filter on both standard labels and CMDB‑derived tags.

3. Future Outlook

Data Quality : Reduce noise by focusing on the most valuable 20‑30 % of reported metrics.

Business Diversification : Support varied cluster requirements, including deduplication, CMDB linkage, and SaaS offerings.

Agent & Scheduler Performance : Continue refactoring Python components to Go for better scalability.

Open‑Source Contributions : Release stable extensions (e.g., vmshuffle, vmdownsample) and explore automated cluster provisioning and SaaS‑style management.

Overall, the transition from a monolithic Python‑based TSDB to a cloud‑native, VictoriaMetrics‑powered stack enables second‑level observability, multi‑tenant isolation, and seamless integration with NetEase’s CMDB, positioning the platform for future scalability and open‑source collaboration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringcloud-nativeObservabilitymetricsTime Series DatabaseVictoriaMetricsCMDB integration
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.