Cloud Native 22 min read

How Zhihu Migrated 2,000+ Microservices to Istio Service Mesh

Zhihu describes its journey from a custom Kodor RPC framework to a full Istio service‑mesh deployment, detailing the background challenges, Consul‑based service discovery, migration strategy, traffic management, platform tooling, and the numerous operational pitfalls encountered along the way.

dbaplus Community
dbaplus Community
dbaplus Community
How Zhihu Migrated 2,000+ Microservices to Istio Service Mesh

Background

Zhihu has long run a fully containerized micro‑service architecture with over two thousand services, using an internal RPC framework and the Kodor system for service connectivity. The main pain points were high maintenance cost of basic components, single‑point failures, inconsistent client‑side features (circuit‑breakers, retries), difficulty updating client versions, capability gaps compared with other large providers, and limited integration with cloud‑native open‑source projects.

Service Mesh was identified as a solution to improve governance, introduce precise circuit‑breakers, rate‑limiting, traffic management, reduce client maintenance, increase inter‑service communication speed, provide fault injection, and enable dynamic routing.

Kodor System Overview

Each micro‑service in Kodor is fronted by an HAProxy container that records metrics, logs, and implements authentication, rate‑limiting, and black‑listing. This proxy‑based approach is conceptually similar to a service mesh.

Service Discovery & Registration

Consul is used for service discovery. HAProxy nodes are registered as service instances, and service metadata is stored in Consul KV. Clients discover the HAProxy address via Consul, and HAProxy learns upstream nodes through consul‑template.

Migration Plan to Service Mesh

Goals

No code changes for business services.

Rollback capability.

High availability.

No noticeable performance degradation.

Mesh services and Kodor services must inter‑communicate.

Traffic Inter‑connectivity Two cases are considered:

Caller inside the mesh: if the target is outside, Discovery returns the HAProxy address; if inside, it returns the ServiceIP and the sidecar routes the request.

Caller outside the mesh: the caller continues to use Consul to reach the HAProxy endpoint.

Traffic Management Sandbox testing is supported by labeling sandbox workloads (e.g., branch=box‑xxx ) and adding matching rules in VirtualService objects.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: sm-verify-title
  namespace: sm-verify
spec:
  gateways:
  - mesh
  - istio-system/svc-ingress
  hosts:
  - sm-verify-title.sm-verify.svc.cluster.local
  http:
  - match:
    - sourceLabels:
        branch: box-10326
    name: box-10326--default
    route:
    - destination:
        host: sm-verify-title--box-10326.sm-verify--box-10326.svc.cluster.local
        subset: default
  - name: master--default
    route:
    - destination:
        host: sm-verify-title--master.sm-verify.svc.cluster.local
        subset: default

Version Rollout & 503 Handling When adding or removing version subsets, first update the DestinationRule (DR) then the VirtualService (VS), or vice‑versa for deletions, ensuring changes take effect in the correct order. The istioctl wait command or a fixed delay (e.g., 30 s) can be used to guarantee readiness. A custom Router CRD is introduced to declare version and routing changes declaratively:

apiVersion: router.service-mesh.zhihu.com/v1
kind: Router
metadata:
  name: sm-verify-web
  namespace: sm-verify
spec:
  port: 9090
  protocol: http
  subsets:
  - branch: master
    versions:
      v1.0.1: 90
      v1.0.2: 10
  - branch: box-0999

Migration Steps

Add the label istio-injection=enabled to pods that need sidecar injection.

Create a Router object for each service.

After these steps, migration is as simple as adding the label and redeploying the workload.

Rollback

System‑wide rollback: downgrade Discovery to pure Consul proxy when Istio experiences large‑scale failures.

Service‑level rollback: modify the label and redeploy.

Service‑Mesh Platform A custom platform was built to let developers modify mesh configuration (routing, rate‑limiting, black‑listing, auth, traffic mirroring, load‑balancing, circuit‑breakers, connection pools, automatic retries, service discovery management, etc.) without touching databases. The platform uses an IstioFilter CRD to patch VirtualService / DestinationRule resources.

Operational Optimizations

Sidecar Performance Large numbers of services cause sidecar memory/CPU spikes because Istio pushes full cluster configuration to every sidecar. To mitigate, sidecar injection is done per‑service, and configuration push frequency is tuned via environment variables ( PILOT_XDS_SEND_TIMEOUT , PILOT_FILTER_GATEWAY_CLUSTER_CONFIG , PILOT_ENABLE_FLOW_CONTROL , etc.).

Sidecar Scoped Configuration Using the Sidecar CRD, only the required dependencies are declared, reducing unnecessary configuration pushes. Example for the Bookinfo page service:

apiVersion: networking.istio.io/v1alpha3
kind: Sidecar
metadata:
  name: page
spec:
  workloadSelector:
    labels:
      app: page
  egress:
  - hosts:
    - "./reviews"
    - "./details"
    - "istio-system/*"

Istiod Performance Tuning Various environment variables ( PILOT_DEBOUNCE_AFTER , PILOT_DEBOUNCE_MAX , PILOT_PUSH_THROTTLE , etc.) are adjusted to reduce push volume, avoid frequent pushes, and improve throughput.

Common Pitfalls & Fixes

Tracing Integration Envoy does not support Jaeger’s UDP protocol, so OpenTelemetry’s OpenCensus agent is used instead, configured via the IstioOperator.

Log Overhead Sidecar access logs generate massive volume; the solution is to disable logs by default and enable them dynamically when needed.

Kubernetes Challenges

IPVS latency spikes due to kernel timer overload – resolved in newer kernels.

DNS latency after switching to ClusterFirst – mitigated by enabling NodeLocalDNS.

Compatibility Issues

HTTP/1.0 support – enable PILOT_HTTP10=true.

Service startup race – set holdApplicationUntilProxyStarts: true and use postStart to wait for the sidecar.

Connection exhaustion – prefer HTTP/2 (gRPC) to avoid the 127.0.0.6 port‑mapping limit.

Host header rewriting – solved with a Lua filter that moves the original host to a custom header and rewrites it to the service VIP.

Default retry policy causing traffic amplification – custom retry policies are applied to gRPC services.

Deployment Considerations

Each Istio cluster needs a root certificate; using a “global root” that signs per‑cluster roots simplifies future multi‑cluster upgrades. Avoid installing without a fixed version to keep canary upgrades manageable.

Conclusion

Service Mesh provides an elegant, cloud‑native solution for micro‑service governance. Istio, as the de‑facto standard, now powers roughly a quarter of Zhihu’s services, including many critical S‑level services, and the migration continues to accelerate. Future work aims to unify service mesh with distributed runtimes (e.g., DB Mesh) to offer language‑agnostic, platform‑agnostic business capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

migrationMicroservicesIstioService MeshConsul
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.