Cloud Native 19 min read

How Vivo Optimized Ingress‑NGINX for High‑Performance AI Workloads on Kubernetes

This article details Vivo's AI container platform, explaining the design, deployment options, performance tuning, stability measures, and future enhancements of the ingress‑nginx gateway that powers AI services within their Kubernetes clusters.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How Vivo Optimized Ingress‑NGINX for High‑Performance AI Workloads on Kubernetes

Background

Since late 2018, Vivo's AI Computing Platform team has built an AI container platform (VContainer) on Kubernetes, accumulating extensive development and operation experience for both online services and offline deep‑learning training. To meet high‑efficiency scheduling demands, online services (C‑end, inference, etc.) are being migrated from VMs or bare metal to the AI container platform.

Ingress Overview

Kubernetes abstracts a group of containers as a Pod and provides workloads such as Deployment, StatefulSet, DaemonSet. Service handles east‑west traffic inside the cluster, while Ingress, via a cluster gateway, manages north‑south traffic from outside to services.

Ingress resources define routing rules; the most common controller is NGINX Ingress. Vivo uses the community‑provided ingress‑nginx controller.

Ingress‑NGINX Components

The controller watches resources (Ingress, Endpoint, ConfigMap, etc.) and translates routing rules into NGINX configuration; NGINX then forwards external HTTP requests to the appropriate Pods.

Routing Schemes

Two upstream routing schemes are used:

Scheme 1 (default): NGINX upstream uses Pod IPs, dynamically updated via ngx‑lua to handle changing Pod IPs.

Scheme 2: NGINX upstream uses Service name, relying on kube‑proxy iptables; this scheme is less dynamic.

Vivo adopts Scheme 1 for its stability.

Architecture

Ingress‑NGINX can be deployed via Deployment + Service (NodePort) or DaemonSet + hostNetwork.

Deployment + Service

Each business deploys an isolated Ingress controller via a Deployment exposed through a NodePort Service. Advantages: complete isolation. Disadvantages: NodePort adds iptables overhead, port exhaustion risk, and high latency; also resource waste when many small services each run a full Ingress controller.

DaemonSet + hostNetwork

A single Ingress pod runs per node using hostNetwork, sharing the Ingress cluster across services. Advantages: lower latency (bypasses iptables/DNAT/conntrack) and better resource utilization. Disadvantages: potential interference between services and need for node planning.

Performance Optimizations

To meet AI workload demands, Vivo tuned both NGINX and kernel parameters.

Host NIC and Interrupt Optimization

ethtool -l eth0          # view NIC queue capabilities</code><code>ethtool -L eth0 combined 8   # enable 8 queues
service irqbalance stop</code><code>sh set_irq_affinity -X all eth0   # Intel interrupt distribution script

Kernel Tuning

sysctl -w net.core.somaxconn=32768
sysctl -w net.ipv4.ip_local_port_range="1024 65000"
sysctl -w fs.file-max=1048576
sysctl -w net.ipv4.tcp_tw_reuse=1

These settings are applied via an initContainer:

initContainers:
  - name: sysctl
    image: alpine:3.10
    securityContext:
      privileged: true
    command:
    - sh
    - -c
    - sysctl -w net.core.somaxconn=32768; sysctl -w net.ipv4.ip_local_port_range='1024 65000'; sysctl -w fs.file-max=1048576; sysctl -w net.ipv4.tcp_tw_reuse=1

NGINX Connection Settings

keep-alive: "75"
keep-alive-requests: "10000"
upstream-keepalive-connections: "200"
upstream-keepalive-requests: "10000"
upstream-keepalive-timeout: "100"
proxy-connect-timeout: "1"
proxy-read-timeout: "3"
proxy-send-timeout: "3"

Stability Enhancements

Active health checks using nginx_upstream_check_module and a /healthz endpoint replace passive checks, preventing one failing service from affecting others.

upstream ingress-backend {
  server 10.192.168.1 max_fails=0 fail_timeout=10s;
  server 10.192.168.2 max_fails=0 fail_timeout=10s;
  check interval=1000 rise=2 fall=2 timeout=1000 type=http default_down=false;
  check_keepalive_requests 1;
  check_http_send "GET /healthz HTTP/1.0

";
  check_http_expect_alive http_2xx;
  zone ingress-backend 1M;
}

Zero‑downtime deployments are achieved by integrating openresty‑based upstream registration with Kubernetes Pod lifecycle hooks.

Validating Webhook

Ingress‑NGINX’s validating webhook checks custom snippets (main, http, location) before applying them, preventing malformed configurations from crashing the controller.

containers:
  - args:
    - --validating-webhook=:9090
    - --validating-webhook-certificate=/usr/local/certificates/validating-webhook.pem
    - --validating-webhook-key=/usr/local/certificates/validating-webhook-key.pem

Monitoring & Alerts

Grafana dashboards and internal alerting cover request success rate, latency, CPU/memory usage, pod health, reload failures, etc., enabling rapid issue detection.

Retry Policy

Both the edge NGINX and Ingress‑NGINX retry mechanisms are tightened: the edge NGINX limits retries with proxy_next_upstream_tries, while Ingress‑NGINX disables retries entirely ( proxy-next-upstream="off").

Future Work

Planned improvements include service‑level rate limiting per Ingress, native GRPC (plaintext HTTP/2) support, enhanced log collection for centralized analysis, and further performance and stability refinements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativePerformance OptimizationKubernetesDevOpsIngress
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.