How Netflix’s Data Gateway Simplifies Distributed Database Access

This article explains how Netflix built the Data Gateway platform to abstract and protect complex distributed databases, detailing its motivation, architecture, component overview, declarative runtime and deployment configurations, and real‑world case studies such as key‑value services, secure RDS, and seamless data migration.

JavaEdge
JavaEdge
JavaEdge
How Netflix’s Data Gateway Simplifies Distributed Database Access

Motivation

Netflix operates many open‑source data stores (Cassandra, EVCache, OpenSearch, etc.) and provides client libraries for developers. While this reduces engineering effort, coupling applications to multiple evolving APIs increases long‑term maintenance cost, risk of misuse, and product‑outage potential. Developers repeatedly reinvent common patterns (e.g., caching) and must integrate service‑discovery, RPC resilience, authentication, and authorization for each database, which is labor‑intensive and error‑prone.

Data Gateway Overview

The Data Gateway platform provides a stable online data access layer (DAL) that exposes custom gRPC and HTTP APIs, abstracts the underlying distributed databases, prevents anti‑patterns, and enhances security, reliability, and scalability.

Component Overview

EC2 instances – high‑performance Linux VMs tuned for low latency.

Data‑gateway proxy – sidecar process that launches container images and manages service registration.

Container runtime – standard OCI runtime that runs, monitors, restarts, and connects proxy and DAL containers.

Envoy proxy – service‑mesh sidecar acting as a reverse proxy.

Data Abstraction Layer (DAL) – containerized application code exposing HTTP/gRPC data‑access services.

Declarative configuration – concise specs describing target clusters and instance state.

Applications connect via Netflix discovery or AWS load balancers; Envoy terminates TLS, authorizes connections, and forwards requests to the appropriate DAL container.

Declarative Configuration

Runtime Configuration

# Configure proxy listeners
proxy_config:
  public_listeners:
    secure_grpc: {mode: grpc, tls_creds: metatron, authz: gandalf, path: 8980}
# Define DAL containers
container_dals:
  cql:
    container_listeners: {secure_grpc: 8980}
    image: "dgw-kv"
  thrift:
    container_listeners: {secure_grpc: 8980}
    image: "dgw-kv"
    env:
      STORAGE_ENGINE: "thrift"
# Advanced wiring
wiring:
  thrift: {mode: shadow, target: cql}

This configuration creates two key‑value DAL containers ( cql and thrift) from the dgw-kv image, exposes a secure gRPC listener on port 8980 with mTLS (Metatron) and authorization (Gandalf), and wires thrift traffic to shadow the cql container.

Deployment Desires

deploy_desires:
  capacity:
    model_name: org.netflix.key-value
    query_pattern:
      access_pattern: latency
      estimated_read_per_second: {low: 2000, mid: 20000, high: 200000}
      estimated_write_per_second: {low: 2000, mid: 20000, high: 200000}
    data_shape:
      estimated_state_size_gib: {low: 20, mid: 200, high: 2000}
      reserved_instance_app_mem_gib: 20
  service_tier: 0
  version_set:
    artifacts:
      dals/dgw-kv: {kind: branch, value: main}
      configs/main: {kind: branch, sha: ${DGW_CONFIG_VERSION}}
  locations:
    - account: prod
      regions: [us-east-2, us-east-1, eu-west-1, us-west-2]
    - account: prod
      regions: [us-east-1]
      stack: leader
  owners:
    - {type: google-group, value: [email protected]}
    - {type: pager, value: our-cool-pagerduty-service}
  consumers:
    - {type: account-app, value: prod-api, group: read-write}
    - {type: account-app, value: studio_prod-ui, group: read-only}

The desires drive capacity planning, automatic instance selection, and staged rollouts. They specify expected read/write rates, data size, service tier, artifact versions, deployment locations, owners, and consuming applications.

Case Study: Key‑Value Service

proxy_config:
  public_listeners:
    secure_grpc: {authz: gandalf, mode: grpc, path: "8980", tls_creds: metatron}
    secure_http: {authz: gandalf, mode: http, path: "8443", tls_creds: metatron}
container_dals:
  kv:
    container_cmd: /apps/dgw-kv/start.sh
    container_listeners: {http: "8080", secure_grpc: "8980", secure_http: "8443"}
    env:
      MEMORY: 8000m
      spring.app.property: property_value
    healthcheck:
      test:
        - CMD-SHELL
        - /usr/bin/curl -f -s --connect-timeout 0.500 --max-time 2 http://envoy:8080/admin/health
    image: "dgw-kv"
registrations:
  - address: shard.dgwkvgrpc,shard.dgwkv
    mode: nflx-discovery

The KV DAL runs a Spring Boot application exposing gRPC and HTTP interfaces. It combines multiple storage engines and implements hedging, side‑cache, large‑data chunking, adaptive paging, and circuit breaking. Envoy creates listeners for the public ports, registers the shard via Netflix discovery, and routes traffic to the KV container.

Case Study: Secure RDS

proxy_config:
  public_listeners:
    secure_postgres: {mode: tcp, path: "5432", tls_creds: metatron, authz: gandalf}
container_dals: {}
network_dals:
  rds:
    listeners:
      secure_postgres: postgresql://rds-db.example.com:5432
    mode: logical_dns

Secure RDS uses the gateway as a transparent pass‑through for PostgreSQL/MySQL. Envoy terminates mTLS, then forwards traffic to the backend RDS cluster. Clients run a local forward‑proxy that discovers the gateway, listens on localhost:5432, and tunnels traffic through mTLS to the gateway, which then forwards to the RDS instance.

Case Study: Seamless Data Migration

proxy_config:
  public_listeners:
    secure_grpc: {mode: grpc, path: 8980}
container_dals:
  cql:
    container_listeners:
      secure_grpc: 8980
  thrift:
    container_listeners:
      secure_grpc: 8980
wiring:
  thrift: {mode: shadow, target: cql}

The platform supports traffic shadowing to migrate data stores. Two DAL containers (primary and secondary) run side‑by‑side; the proxy routes live traffic to the primary and shadows it to the secondary. After back‑filling, the secondary is promoted. This approach enabled migration of hundreds of Cassandra‑2 clusters to Cassandra‑3.

Conclusion and Future Work

The Data Gateway reduces operational complexity, protects developers from low‑level database APIs, and provides a unified, secure, and scalable data access layer. Future work includes unified authentication/authorization for L4/L7 databases, additional gRPC services (time‑series, entity), and further platform enhancements.

service meshNetflixData GatewayDeclarative DeploymentKey-Value ServiceSecure RDS
JavaEdge
Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.