How SHANGFU Transforms Prometheus Management for Scalable Cloud‑Native Monitoring
This article explains Prometheus fundamentals, compares long‑term storage options, details Huolala's challenges with multiple Prometheus clusters, and introduces SHANGFU—a three‑module system that streamlines configuration, collection, and query handling to boost observability, performance, and reliability in cloud‑native environments.
What is Prometheus?
Prometheus is a time‑series database widely used in cloud‑native environments, supported by Kubernetes ecosystems, cloud‑provider integrations, and numerous metric collection plugins, making it a standard solution for application performance monitoring (APM).
Applications expose metrics via the Prometheus client SDK or exporters; Prometheus scrapes these endpoints, stores data locally, can apply relabel rules for enrichment, and optionally remote‑writes to external storage. Alerting is configured with powerful PromQL rules and handled by Alertmanager.
Data Persistence Options
Prometheus was not originally designed for long‑term storage; its single‑instance capacity is limited by hardware. Common long‑term solutions include Cortex, InfluxDB, M3, Thanos, and VictoriaMetrics. VictoriaMetrics is highlighted here.
VictoriaMetrics separates collection, query, storage, and alerting into independent components, remains compatible with Prometheus APIs, and outperforms Prometheus in memory, CPU, and disk IOPS. It requires no sidecar or extra dependencies, making it a superior choice for long‑term metric storage.
Prometheus Usage at Huolala
Huolala's Choice
Huolala, a logistics unicorn, adopts Prometheus as the foundation for application monitoring and selects VictoriaMetrics for persistent storage. Its logical architecture is shown below.
Multiple Prometheus instances are deployed per language stack, with dedicated clusters for business units such as risk control and security. All instances write to VictoriaMetrics via Remote Write, and queries retrieve data through the vm‑select component.
Challenges
Complex configuration and migration – Dozens of Prometheus clusters require manual configuration checks during upgrades, leading to errors and high operational overhead.
Abnormal queries overload storage – Large‑range queries with many labels cause VictoriaMetrics instability; lack of query throttling exacerbates the issue.
High memory and CPU pressure – Legacy middleware emits redundant or complex metrics, increasing load; SDK version differences create fragmented relabel rules.
Need for a collection proxy – To hide multi‑version, multi‑language differences and offload relabel maintenance.
Our Solution: SHANGFU
SHANGFU (Chinese “尚付”) is a self‑designed system providing Prometheus configuration management, collection management, and query enhancement.
“Shangfu” originates from a mythological bird described in the Classic of Mountains and Seas.
The system consists of three modules: a collection proxy, a query proxy, and a Prometheus cluster management module.
Logical Architecture
SHANGFU Server provides the three services; SHANGFU UI offers a visual interface. Both run statelessly in Kubernetes behind a domain name.
Physical Deployment
Servers are stateless; a dual‑write mechanism ensures data collection continues if one server fails. Kubernetes health checks enable rapid recovery and scaling.
High Availability
Domain‑level health probing redirects traffic on node failure; if all instances are down, manual fallback to the original Prometheus/VictoriaMetrics configuration is possible. The impact of each module’s failure is summarized: query proxy failure blocks data queries (affects business), collection proxy failure leads to data loss (affects business), and cluster management failure does not directly affect business but requires manual config updates.
Detailed Module Description
Collection Proxy
Uses Prometheus relabel_config to rewrite targets, parses incoming metrics, removes obsolete labels, normalizes label values, aggregates values, and forwards both original and enhanced metrics to Prometheus.
It also bypasses Prometheus body_size_limit by filtering and compacting data before forwarding.
Query Proxy
Provides customized query functions for VictoriaMetrics while remaining compatible with native Prometheus queries. Features include:
Rate limiting and blocking based on IP, metric name, or request headers.
Proxy data domain hides real service addresses.
Query degradation based on time range or metric type.
Metadata pre‑loading for large metrics.
Default limits on label values and series count, adding topk_max where needed.
Query templates that decouple front‑end from collection logic.
Query caching and aggregation to accelerate repeated requests.
Automatic metric conversion from collection proxy output.
Health checks and internal management UI.
In practice, automatic metric conversion reduced scanned data points by 83% and cut query time from 3.8 s to 203 ms (95% faster).
Prometheus Cluster Management
Offers an SSO‑authenticated UI for viewing cluster overviews and editing configuration items (conf.yaml) such as scrape settings, service discovery, authentication, remote read/write, and more.
Common settings are abstracted into single entries; SHANGFU generates complete configuration files and distributes them to all Prometheus instances, enabling one‑click updates across the fleet.
Import/export, migration, and rollback operations are performed via the white‑screen UI, reducing manual file edits and errors. Audit logs are available for tracking changes.
Conclusion
SHANGFU’s three modules—collection proxy, query proxy, and cluster management—address daily monitoring and operational pain points, greatly improving system stability and efficiency.
Future Outlook
After one year in production, SHANGFU plans deeper CMDB integration, metric tenancy, streaming compute for faster queries and alerts, further metric standardization to simplify PromQL, and eventual open‑source release to benefit the broader community.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
