Operations 7 min read

How Pinterest Scales Service Discovery with ZooKeeper—and Overcomes Its Pitfalls

Pinterest shares its real‑world experience using ZooKeeper for service discovery and dynamic configuration, detailing the challenges of connection overload, transaction spikes, network partitions, and human error, and explains the multi‑step solutions that improved resilience and reduced operational risk.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
How Pinterest Scales Service Discovery with ZooKeeper—and Overcomes Its Pitfalls

1 Service Discovery

Pinterest runs many servers that call backend services; each service is replicated across multiple machines for load balancing and fault tolerance. To keep an up‑to‑date list of available service instances, each server registers an ephemeral node in ZooKeeper, allowing callers to discover live instances instantly.

If a service instance fails, its ZooKeeper node disappears and callers are notified; new instances register new nodes, keeping the service list current.

2 Dynamic Configuration

Pinterest’s distributed database is sharded by user ID. The front‑end Data Service layer, composed of many machines, must know which shard holds a particular user’s data. The mapping between user IDs and database shards is configuration data that changes when new users are added.

To propagate configuration changes quickly, Pinterest stores this mapping in ZooKeeper. Data Service nodes watch the ZooKeeper node and update their local configuration as soon as it changes.

Factors That Caused ZooKeeper Problems

1 Too Many Connections

Pinterest’s large scale creates a high number of ZooKeeper client connections, which can slow down or even render ZooKeeper unavailable.

2 Excessive Transaction Volume

Massive server restarts generate a burst of registration events, overwhelming ZooKeeper with write traffic.

3 Network Partitions

Rare network failures can split the ZooKeeper ensemble into separate partitions, causing service disruption.

4 Human Error

Operational mistakes have occasionally rendered ZooKeeper unusable.

Attempted Solutions

1 Increase Capacity

Adding more ZooKeeper servers helped only up to about ten nodes; beyond that write performance degraded.

2 Add Observers

ZooKeeper supports an observer role that does not participate in leader election, reducing election overhead. Pinterest converted several followers to observers, lowering read‑side load, though write performance saw little improvement.

3 Use Multiple ZooKeeper Ensembles

Separate ensembles were assigned to different functions (e.g., deployment system, HBase). This mitigated load but did not fully solve the problem.

4 Fallback to Static Files

Static files were used as a backup for service lists and configuration. While functional, managing large static files became cumbersome.

Current Best Solution

Each server now runs a daemon that maintains a single ZooKeeper connection. The daemon watches for changes (service list or configuration) and writes updates to a local file. All applications on that server read the local file instead of connecting directly to ZooKeeper.

This architecture dramatically reduces the number of ZooKeeper connections, isolates failures to the daemon process, and allows custom validation logic before writing new data, improving overall fault tolerance.

For more details, see the original article: https://engineering.pinterest.com/blog/zookeeper-resilience-pinterest

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

service discoveryZooKeeperDynamic Configurationoperational resiliencePinterest
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.