Service Discovery in Envoy: Types, Consistency Models, and Health‑Check Routing
This article explains Envoy’s service discovery mechanisms—including static, strict DNS, logical DNS, original‑destination, and Service Discovery Service—detailing how they work, their consistency models, and how health‑checking influences routing decisions in production environments.
When defining upstream clusters in configuration, Envoy must know how to resolve the members of the cluster, which is called service discovery.
Supported Service Discovery Types
Static
Static is the simplest type; the configuration explicitly lists each upstream host’s resolved network name (IP/port, Unix socket, etc.).
Strict DNS
With strict DNS, Envoy continuously and asynchronously resolves the specified DNS target. Each IP address returned is treated as an explicit host in the upstream cluster, and hosts are added or removed based on DNS results. Envoy never performs synchronous DNS resolution in the forwarding path, accepting eventual consistency.
Logical DNS
Logical DNS uses the same asynchronous mechanism as strict DNS but assumes the first IP address returned represents the whole upstream cluster, keeping a single connection pool that can serve many physical hosts. This avoids connection churn when interacting with large web services that return many IPs per query.
Original Destination
When inbound connections are redirected to Envoy via iptables REDIRECT or the Proxy protocol, the original destination cluster can be used. Requests are forwarded to the upstream host based on redirect metadata without explicit host configuration. Unused hosts are cleaned up after a configurable idle interval.
Service Discovery Service (SDS)
SDS is a generic REST API that Envoy uses to obtain cluster members. Lyft’s reference implementation uses AWS DynamoDB as a backend, but the API is simple enough to be implemented over various stores. Envoy periodically polls the SDS for members, making it the preferred mechanism because it provides per‑host insight and additional attributes such as load‑balancing weight, canary status, and region.
Typically, active health checking is combined with eventually consistent service discovery data to drive load‑balancing and routing decisions.
Eventually Consistent Service Discovery
Many RPC systems require strongly consistent service discovery using systems like Zookeeper, etcd, or Consul, which can be painful at scale. Envoy is designed for eventually consistent discovery, assuming hosts appear in the mesh in an eventually consistent manner and recommending active health checks to determine cluster health.
All health decisions are fully distributed, allowing graceful handling of network partitions. Envoy uses a 2×2 matrix to decide routing based on discovery status and health‑check result:
Discovery Status
HC OK
HC Failed
Discovered
Route
Don’t Route
Absent
Route
Don’t Route / Delete
Host discovered & health‑check OK – Envoy routes to the target host.
Host absent & health‑check OK – Envoy still routes to the target host, allowing existing hosts to continue serving while new hosts cannot be added until discovery data returns.
Host discovered & health‑check FAIL – Envoy does not route to the target host, assuming health‑check data is more accurate.
Host absent & health‑check FAIL – Envoy does not route and deletes the target host; this is the only state where Envoy clears host data.
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.