Understanding Prometheus Agent Mode and Remote Write
This article explains the design, benefits, and practical usage of Prometheus' new Agent mode and remote‑write capabilities, covering its pull‑model origins, global‑view challenges, federation alternatives, and how the lightweight Agent improves efficiency and scalability for cloud‑native monitoring.
Bartek Plotka, a Red Hat principal software engineer and Prometheus maintainer, also co‑author of CNCF Thanos and author of "Efficient Go", shares insights on the evolution of Prometheus.
Prometheus is praised for its pragmatic, reliable, and cost‑effective monitoring system, offering a stable API, powerful query language, and protocols such as remote write and OpenMetrics that have fostered a thriving cloud‑native observability ecosystem.
The community provides a wide range of exporters—including for containers, eBPF, Minecraft, and even horticulture—that expose metrics via a simple HTTP endpoint like /metrics, a concept originally internal to Google.
This shift toward metric‑driven observability has changed how SREs and developers improve system resilience, troubleshooting, and data‑driven decision making.
Today, it is rare to find a Kubernetes cluster without Prometheus running, and the project’s ecosystem has expanded with projects like Thanos and support from cloud providers such as Amazon, Google, and Grafana Cloud.
The article introduces the new "Agent" feature, which disables certain server capabilities to optimise remote‑write performance, enabling a new application pattern.
Historically, Prometheus follows a pull‑model inspired by Google Borgmon: each application runs its own Prometheus instance that scrapes metrics from an HTTP endpoint, avoiding complex push mechanisms or client libraries.
As cloud‑native environments evolve—e.g., managed Kubernetes, edge clusters, and resource‑constrained nodes—there is a need for global‑view aggregation, which can be achieved via federation, remote read, or remote write, each with trade‑offs.
Remote write allows users to forward selected or all metrics to external storage APIs; the protocol is being standardised and already supported by Cortex, Thanos, OpenTelemetry, and major cloud vendors.
Prometheus provides compliance tests for remote‑write implementations, helping developers verify correct protocol handling.
While remote write offers a powerful way to centralise data, pushing directly from applications can be risky due to loss of visibility and authentication challenges.
The Agent mode, available from Prometheus v2.32.0, is enabled with the flag --enable-feature=agent. It retains scraping, service discovery, and remote‑write logic but removes query, alerting, and local storage, using a custom TSDB WAL that deletes data after successful forwarding.
This design reduces memory and CPU usage, making it ideal for edge or resource‑limited environments, and simplifies horizontal scaling by treating agents as essentially stateless collectors.
Agent mode has been validated at scale by Grafana Labs and integrated into the main codebase after a brief experimental period.
To use the feature, run prometheus --help and look for flags such as --storage.agent.path="data-agent/" and the enabling flag --enable-feature=agent. The Web UI’s query functionality is disabled, but configuration, scrape targets, and service discovery remain visible.
Hands‑on tutorials, such as the Katacoda Prometheus remote‑write/Thanos lab, allow users to try the Agent mode and explore its remote‑write capabilities.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
