Cloud Native 15 min read

How SHANGFU Transforms Prometheus Management for Scalable Cloud‑Native Monitoring

This article explains Prometheus fundamentals, compares long‑term storage options, details Huolala's challenges with multiple Prometheus clusters, and introduces SHANGFU—a three‑module system that streamlines configuration, collection, and query handling to boost observability, performance, and reliability in cloud‑native environments.

Huolala Tech

Mar 9, 2023

How SHANGFU Transforms Prometheus Management for Scalable Cloud‑Native Monitoring

What is Prometheus?

Prometheus is a time‑series database widely used in cloud‑native environments, supported by Kubernetes ecosystems, cloud‑provider integrations, and numerous metric collection plugins, making it a standard solution for application performance monitoring (APM).

Applications expose metrics via the Prometheus client SDK or exporters; Prometheus scrapes these endpoints, stores data locally, can apply relabel rules for enrichment, and optionally remote‑writes to external storage. Alerting is configured with powerful PromQL rules and handled by Alertmanager.

Data Persistence Options

Prometheus was not originally designed for long‑term storage; its single‑instance capacity is limited by hardware. Common long‑term solutions include Cortex, InfluxDB, M3, Thanos, and VictoriaMetrics. VictoriaMetrics is highlighted here.

VictoriaMetrics separates collection, query, storage, and alerting into independent components, remains compatible with Prometheus APIs, and outperforms Prometheus in memory, CPU, and disk IOPS. It requires no sidecar or extra dependencies, making it a superior choice for long‑term metric storage.

Prometheus Usage at Huolala

Huolala's Choice

Huolala, a logistics unicorn, adopts Prometheus as the foundation for application monitoring and selects VictoriaMetrics for persistent storage. Its logical architecture is shown below.

Multiple Prometheus instances are deployed per language stack, with dedicated clusters for business units such as risk control and security. All instances write to VictoriaMetrics via Remote Write, and queries retrieve data through the vm‑select component.

Challenges

Complex configuration and migration – Dozens of Prometheus clusters require manual configuration checks during upgrades, leading to errors and high operational overhead.

Abnormal queries overload storage – Large‑range queries with many labels cause VictoriaMetrics instability; lack of query throttling exacerbates the issue.

High memory and CPU pressure – Legacy middleware emits redundant or complex metrics, increasing load; SDK version differences create fragmented relabel rules.

Need for a collection proxy – To hide multi‑version, multi‑language differences and offload relabel maintenance.

Our Solution: SHANGFU

SHANGFU (Chinese “尚付”) is a self‑designed system providing Prometheus configuration management, collection management, and query enhancement.

“Shangfu” originates from a mythological bird described in the Classic of Mountains and Seas.

The system consists of three modules: a collection proxy, a query proxy, and a Prometheus cluster management module.

Logical Architecture

SHANGFU Server provides the three services; SHANGFU UI offers a visual interface. Both run statelessly in Kubernetes behind a domain name.

Physical Deployment

Servers are stateless; a dual‑write mechanism ensures data collection continues if one server fails. Kubernetes health checks enable rapid recovery and scaling.

High Availability

Domain‑level health probing redirects traffic on node failure; if all instances are down, manual fallback to the original Prometheus/VictoriaMetrics configuration is possible. The impact of each module’s failure is summarized: query proxy failure blocks data queries (affects business), collection proxy failure leads to data loss (affects business), and cluster management failure does not directly affect business but requires manual config updates.

Detailed Module Description

Collection Proxy

Uses Prometheus relabel_config to rewrite targets, parses incoming metrics, removes obsolete labels, normalizes label values, aggregates values, and forwards both original and enhanced metrics to Prometheus.

It also bypasses Prometheus body_size_limit by filtering and compacting data before forwarding.

Query Proxy

Provides customized query functions for VictoriaMetrics while remaining compatible with native Prometheus queries. Features include:

Rate limiting and blocking based on IP, metric name, or request headers.

Proxy data domain hides real service addresses.

Query degradation based on time range or metric type.

Metadata pre‑loading for large metrics.

Default limits on label values and series count, adding topk_max where needed.

Query templates that decouple front‑end from collection logic.

Query caching and aggregation to accelerate repeated requests.

Automatic metric conversion from collection proxy output.

Health checks and internal management UI.

In practice, automatic metric conversion reduced scanned data points by 83% and cut query time from 3.8 s to 203 ms (95% faster).

Prometheus Cluster Management

Offers an SSO‑authenticated UI for viewing cluster overviews and editing configuration items (conf.yaml) such as scrape settings, service discovery, authentication, remote read/write, and more.

Common settings are abstracted into single entries; SHANGFU generates complete configuration files and distributes them to all Prometheus instances, enabling one‑click updates across the fleet.

Import/export, migration, and rollback operations are performed via the white‑screen UI, reducing manual file edits and errors. Audit logs are available for tracking changes.

Conclusion

SHANGFU’s three modules—collection proxy, query proxy, and cluster management—address daily monitoring and operational pain points, greatly improving system stability and efficiency.

Future Outlook

After one year in production, SHANGFU plans deeper CMDB integration, metric tenancy, streaming compute for faster queries and alerts, further metric standardization to simplify PromQL, and eventual open‑source release to benefit the broader community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Kubernetes Prometheus

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.