Operations 10 min read

Scaling Ele.me’s Monitoring: From StatsD to a Unified LinDB‑Powered Platform

This article recounts Huang Jie’s presentation on the evolution of Ele.me’s monitoring system, detailing its three development stages, the challenges faced, the layered monitoring architecture, the design of a unified platform supporting both PC and mobile, and the underlying LinDB time‑series database.

dbaplus Community
dbaplus Community
dbaplus Community
Scaling Ele.me’s Monitoring: From StatsD to a Unified LinDB‑Powered Platform

Background

Ele.me’s monitoring platform has gone through three major generations:

Stage 1 (pre‑2015) : Business metrics were collected with StatsD → Graphite → Grafana, full‑link tracing with ETrace, server health with Zabbix, and log search with ELog.

Stage 2 (multi‑IDC active‑active) : To support a geographically distributed architecture, a custom time‑series store LinDB was built. Most components were replaced by ESM, InfluxDB, Grafana and the ELK stack for logs.

Stage 3 (consolidation) : All previous stacks were unified into a single platform called EMonitor + LinDB . Log storage migrated to Alibaba Cloud SLS.

Problems Encountered

Running several independent monitoring stacks required engineers to switch contexts between tools, which slowed incident diagnosis and increased mean‑time‑to‑resolution. The team needed a single system that could be adopted quickly by any engineer and that emphasized fast problem discovery and pinpointing.

Scenario‑Based Architecture

The monitoring solution is organized into four logical layers:

Business layer : metrics directly related to product features.

Application layer : service‑level metrics and full‑link tracing.

PaaS layer : platform‑as‑a‑service components (e.g., middleware, container orchestration).

IaaS layer : infrastructure metrics (CPU, memory, network, storage).

Each layer provides a distinct perspective, and tracing data links the layers together, enabling a one‑stop view for both PC and mobile dashboards. Alerts and deployment events are automatically overlaid on the charts, allowing immediate rollback when a change causes an anomaly.

System Design

The architecture evolved from a classic pipeline model to a Lambda‑style design that can be deployed in multiple IDC regions. Key design points include:

Full‑volume log ingestion with a metric‑plus‑sampling strategy to keep storage costs manageable while preserving detail for anomaly detection.

Support for Java, Go, Python, PHP, C++, and Node.js agents.

All metric calculations are performed in 10‑second windows.

Combination of self‑developed and open‑source components for scalability and reliability.

Shaka – Real‑time Calculation Platform

Shaka is a custom stream‑processing engine that performs data cleaning, enrichment, and aggregation before persisting results to LinDB. Its main features are:

Complex Event Processing (CEP) based on Esper to provide SQL‑like queries over streaming data.

Conversion of unstructured log lines into structured records using user‑defined functions (UDFs).

UDFs for anomaly analysis, sampling, and custom metric derivation.

LinDB – Distributed Time‑Series Database

LinDB stores monitoring data using a Metric + Tags + Fields model and provides the following capabilities:

Series sharding for horizontal scaling across dozens of nodes.

Automatic roll‑up from seconds → minutes → hours → days.

Multi‑replica high availability with cross‑IDC replication.

Self‑monitoring and data‑governance modules that track storage health and schema evolution.

Columnar LSM‑based storage optimized for time‑series workloads.

Inverted indexes for fast tag‑based queries.

Operational Metrics (as of the latest release)

36 servers distributed across several clusters.

~140 TB of daily write volume.

Peak ingestion rate of 7.5 million data points per second (≈7.5 M DPS).

10‑second windows retained for 30 days; raw data retained for over two years.

Compressed storage usage ≈50 TB (≈60× compression).

Query P99 latency between 500 ms and 1 s.

Open‑Source Release

The monitoring platform and LinDB are released under an open‑source license. Interested users can clone or browse the repositories:

https://github.com/lindb/lindb

https://lindb.io/

Monitoringobservabilitysystem designEMonitorLinDBlarge-scale
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.