Efficient Ops
Efficient Ops
Sep 16, 2025 · Backend Development

Why Tomcat Thread‑Pool Saturation Crashed Our Service and How to Avoid It

A detailed post‑mortem explains how a sudden traffic surge, insufficient pod count, and a custom thread‑pool bottleneck caused Tomcat thread‑pool saturation, health‑check failures, and a zone‑wide outage, and offers concrete lessons on capacity planning, monitoring, and safe coding practices.

CapacityPlanningJavaPerformance
0 likes · 28 min read
Why Tomcat Thread‑Pool Saturation Crashed Our Service and How to Avoid It
dbaplus Community
dbaplus Community
Feb 28, 2023 · Operations

How Container SRE at DeWu Boosts Reliability: Practices, Metrics, and Incident Playbooks

This article details DeWu's container SRE approach, covering SRE fundamentals, on‑call response, SLO/SLA design, change management, capacity planning, kernel‑parameter monitoring, security safeguards, and a real‑world incident analysis, providing actionable insights for building resilient cloud‑native services.

CapacityPlanningIncidentResponseKubernetes
0 likes · 24 min read
How Container SRE at DeWu Boosts Reliability: Practices, Metrics, and Incident Playbooks
DataFunTalk
DataFunTalk
Oct 27, 2020 · Databases

Didi's Large‑Scale Elasticsearch Upgrade: Architecture, Migration Strategy, and Performance Gains

This article systematically details Didi's migration of over 30 Elasticsearch clusters, 3,500 nodes and 8 PB of data from version 2.3.3 to 6.6.1, covering background, problem analysis, multi‑version architecture redesign, capacity planning, tiered storage, FastIndex, query replay, upgrade pitfalls, and the resulting cost reduction and performance improvements.

CapacityPlanningElasticsearchPerformance
0 likes · 15 min read
Didi's Large‑Scale Elasticsearch Upgrade: Architecture, Migration Strategy, and Performance Gains