Operations 12 min read

Prometheus-Based Monitoring Solution for the 58 Cloud Search Platform

This article describes the challenges of scaling the 58 Cloud Search service, explains why Prometheus was selected as the monitoring stack, and details the architecture, data collection, storage, alerting, visualization, and future enhancements of the resulting cloud‑native monitoring system.

58 Tech
58 Tech
58 Tech
Prometheus-Based Monitoring Solution for the 58 Cloud Search Platform

Cloud Search is a search‑service platform launched by the 58 Group's search technology department to provide mature search capabilities for internal vertical businesses. After two years of stable operation and integration with hundreds of search instances, increasing business load and cluster size made container‑level and Kubernetes‑resource monitoring increasingly complex, prompting the design of a Prometheus‑based monitoring solution.

The monitoring challenges include dynamic and unpredictable resource objects, a wide and heterogeneous monitoring scope, complex inter‑instance call relationships, the need for high reliability with backup mechanisms, and the requirement for rapid container deployment and horizontal scaling in a cloud‑native environment.

After evaluating options, the open‑source Prometheus stack was chosen because of its flexible data model with label‑based metrics, architecture that fits Kubernetes (e.g., Node‑exporter as a DaemonSet), powerful PromQL query language, extensive component ecosystem (Alertmanager, Node‑exporter, etc.), and a mature community.

The overall physical architecture consists of five modules: data collection, data storage, alert distribution, alert reporting, and data visualization (see Figure 1).

Cloud Search monitoring system physical architecture
Cloud Search monitoring system physical architecture

Data collection module: uses three methods – (1) Prometheus Node‑exporter to gather server metrics (CPU, disk, memory, network, etc.), (2) Heapster with cAdvisor to collect Kubernetes resource metrics, and (3) a Crontab‑based approach to monitor binary‑deployed components and enable remote recovery.

Data storage module: stores collected metrics in InfluxDB, while remaining compatible with the existing MySQL database.

Alert reporting module: replaces low‑visibility email alerts with a Webhook that pushes notifications to a WeChat alert platform for immediate response.

Data visualization module: employs Grafana to present cluster‑level dashboards, supporting customizable charts and multiple data sources.

Implementation details include namespace isolation between monitoring and search instances, automatic service discovery for new nodes or objects, multi‑instance deployment for high availability, full containerization of all components, and one‑click deployment via Kubernetes manifests.

Multi‑dimensional monitoring: monitors cluster base components (node failures, network issues), resource objects (pods, containers), and service availability using liveness/readiness probes.

Cluster resource monitoring combines Prometheus and Heapster (see Figure 2) to gather both hardware and container metrics, while remote component monitoring adds pull‑based data collection (Figures 3‑4).

Alert merging is achieved by defining custom Prometheus metrics (Counter, Gauge, Histogram, Summary) and filtering rules (see Figure 5) to combine multiple similar alerts into a single JSON payload, reducing alert storms (Figures 6‑7).

Service availability monitoring relies on Kubernetes probes to detect pod and container health, triggering immediate alerts on failures.

Custom monitoring presentation uses Grafana with PromDash for rapid configuration of monitoring items via ConfigMap hot‑loading (see Figure 8).

Summary and outlook: The monitoring system now offers high availability, three‑dimensional monitoring, and dynamic scaling. Future work includes adding log monitoring and distributed tracing with ELK, accelerating alert response with self‑healing mechanisms, and adopting Helm for simplified deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeKubernetesPrometheusGrafanaAlertmanager
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.