How Netflix’s Data Gateway Simplifies Distributed Database Access
This article explains how Netflix built the Data Gateway platform to abstract and protect complex distributed databases, detailing its motivation, architecture, component overview, declarative runtime and deployment configurations, and real‑world case studies such as key‑value services, secure RDS, and seamless data migration.
Motivation
Netflix operates many open‑source data stores (Cassandra, EVCache, OpenSearch, etc.) and provides client libraries for developers. While this reduces engineering effort, coupling applications to multiple evolving APIs increases long‑term maintenance cost, risk of misuse, and product‑outage potential. Developers repeatedly reinvent common patterns (e.g., caching) and must integrate service‑discovery, RPC resilience, authentication, and authorization for each database, which is labor‑intensive and error‑prone.
Data Gateway Overview
The Data Gateway platform provides a stable online data access layer (DAL) that exposes custom gRPC and HTTP APIs, abstracts the underlying distributed databases, prevents anti‑patterns, and enhances security, reliability, and scalability.
Component Overview
EC2 instances – high‑performance Linux VMs tuned for low latency.
Data‑gateway proxy – sidecar process that launches container images and manages service registration.
Container runtime – standard OCI runtime that runs, monitors, restarts, and connects proxy and DAL containers.
Envoy proxy – service‑mesh sidecar acting as a reverse proxy.
Data Abstraction Layer (DAL) – containerized application code exposing HTTP/gRPC data‑access services.
Declarative configuration – concise specs describing target clusters and instance state.
Applications connect via Netflix discovery or AWS load balancers; Envoy terminates TLS, authorizes connections, and forwards requests to the appropriate DAL container.
Declarative Configuration
Runtime Configuration
# Configure proxy listeners
proxy_config:
public_listeners:
secure_grpc: {mode: grpc, tls_creds: metatron, authz: gandalf, path: 8980}
# Define DAL containers
container_dals:
cql:
container_listeners: {secure_grpc: 8980}
image: "dgw-kv"
thrift:
container_listeners: {secure_grpc: 8980}
image: "dgw-kv"
env:
STORAGE_ENGINE: "thrift"
# Advanced wiring
wiring:
thrift: {mode: shadow, target: cql}This configuration creates two key‑value DAL containers ( cql and thrift) from the dgw-kv image, exposes a secure gRPC listener on port 8980 with mTLS (Metatron) and authorization (Gandalf), and wires thrift traffic to shadow the cql container.
Deployment Desires
deploy_desires:
capacity:
model_name: org.netflix.key-value
query_pattern:
access_pattern: latency
estimated_read_per_second: {low: 2000, mid: 20000, high: 200000}
estimated_write_per_second: {low: 2000, mid: 20000, high: 200000}
data_shape:
estimated_state_size_gib: {low: 20, mid: 200, high: 2000}
reserved_instance_app_mem_gib: 20
service_tier: 0
version_set:
artifacts:
dals/dgw-kv: {kind: branch, value: main}
configs/main: {kind: branch, sha: ${DGW_CONFIG_VERSION}}
locations:
- account: prod
regions: [us-east-2, us-east-1, eu-west-1, us-west-2]
- account: prod
regions: [us-east-1]
stack: leader
owners:
- {type: google-group, value: [email protected]}
- {type: pager, value: our-cool-pagerduty-service}
consumers:
- {type: account-app, value: prod-api, group: read-write}
- {type: account-app, value: studio_prod-ui, group: read-only}The desires drive capacity planning, automatic instance selection, and staged rollouts. They specify expected read/write rates, data size, service tier, artifact versions, deployment locations, owners, and consuming applications.
Case Study: Key‑Value Service
proxy_config:
public_listeners:
secure_grpc: {authz: gandalf, mode: grpc, path: "8980", tls_creds: metatron}
secure_http: {authz: gandalf, mode: http, path: "8443", tls_creds: metatron}
container_dals:
kv:
container_cmd: /apps/dgw-kv/start.sh
container_listeners: {http: "8080", secure_grpc: "8980", secure_http: "8443"}
env:
MEMORY: 8000m
spring.app.property: property_value
healthcheck:
test:
- CMD-SHELL
- /usr/bin/curl -f -s --connect-timeout 0.500 --max-time 2 http://envoy:8080/admin/health
image: "dgw-kv"
registrations:
- address: shard.dgwkvgrpc,shard.dgwkv
mode: nflx-discoveryThe KV DAL runs a Spring Boot application exposing gRPC and HTTP interfaces. It combines multiple storage engines and implements hedging, side‑cache, large‑data chunking, adaptive paging, and circuit breaking. Envoy creates listeners for the public ports, registers the shard via Netflix discovery, and routes traffic to the KV container.
Case Study: Secure RDS
proxy_config:
public_listeners:
secure_postgres: {mode: tcp, path: "5432", tls_creds: metatron, authz: gandalf}
container_dals: {}
network_dals:
rds:
listeners:
secure_postgres: postgresql://rds-db.example.com:5432
mode: logical_dnsSecure RDS uses the gateway as a transparent pass‑through for PostgreSQL/MySQL. Envoy terminates mTLS, then forwards traffic to the backend RDS cluster. Clients run a local forward‑proxy that discovers the gateway, listens on localhost:5432, and tunnels traffic through mTLS to the gateway, which then forwards to the RDS instance.
Case Study: Seamless Data Migration
proxy_config:
public_listeners:
secure_grpc: {mode: grpc, path: 8980}
container_dals:
cql:
container_listeners:
secure_grpc: 8980
thrift:
container_listeners:
secure_grpc: 8980
wiring:
thrift: {mode: shadow, target: cql}The platform supports traffic shadowing to migrate data stores. Two DAL containers (primary and secondary) run side‑by‑side; the proxy routes live traffic to the primary and shadows it to the secondary. After back‑filling, the secondary is promoted. This approach enabled migration of hundreds of Cassandra‑2 clusters to Cassandra‑3.
Conclusion and Future Work
The Data Gateway reduces operational complexity, protects developers from low‑level database APIs, and provides a unified, secure, and scalable data access layer. Future work includes unified authentication/authorization for L4/L7 databases, additional gRPC services (time‑series, entity), and further platform enhancements.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
