Operations 23 min read

Container SRE Practices and Incident Management at DeWu

DeWu’s container SRE team combines software‑engineered reliability with routine operations, using defined on‑call roles, SLO/SLA targets, progressive change management, capacity forecasting, four‑metric monitoring, MTTR/MTTF tracking, kernel‑parameter tuning, and namespace‑protected security policies to swiftly resolve incidents such as Redis latency spikes.

DeWu Technology

Feb 8, 2023

Container SRE Practices and Incident Management at DeWu

This article introduces the concept of Site Reliability Engineering (SRE) and shares concrete practices of the DeWu container SRE team.

SRE Definition : An SRE is a stability engineer who applies software engineering to solve complex operational problems. The role is roughly split 50% on routine operations and 50% on building software to ensure service stability and scalability, including monitoring, logging, alerting, and performance tuning.

On‑call and Emergency Response : The on‑call model typically involves two engineers (one senior, one shadow). Key roles include Incident Commander (IC), Communication Lead (CL), Operations Lead (OL), and Incident Responders (IR) such as SREs, developers, DBAs, and QA. Clear escalation paths and step‑by‑step incident handling procedures are defined.

SLO / SLA : Service Level Indicator (SLI) measures latency, throughput, error rate, availability, etc. Service Level Objective (SLO) sets target values for specific SLIs, while Service Level Agreement (SLA) is the formal contract with users. The article stresses balancing reliability gains against the non‑linear cost of higher availability.

Change Management : Best practices include progressive releases, rapid detection of failures, and safe rollbacks.

Capacity Planning : Emphasizes accurate demand‑forecast models, longer lead times than resource acquisition, and periodic stress testing.

Monitoring System : Describes the four golden metrics—latency, traffic, errors, and saturation—and explains why each is critical for building effective alerting.

Reliability Measurement : Highlights MTTR (Mean Time To Recovery) and MTTF (Mean Time To Failure) as key indicators; automation reduces MTTR.

Case Study – Latency Issue : A Redis response‑time spike caused request timeouts. Investigation steps included:

Network check – latency remained stable (2 ms → 4 ms).

Packet drop analysis revealed high drop counts and TCP memory pressure.

IO wait times were unusually high, prompting deeper kernel inspection.

Root cause was TCP memory exhaustion. The team consulted kernel source ( tcp_input.c) and observed OOM logs.

Remediation involved increasing TCP memory limits:

select * from cpus where time > now() - 4h and host = 'i-bp11f8g5h7oofu5pqgr8' and iowait > 50.0

Shell commands used:

# command to view tcp_mem parameters
sysctl -a|grep -i tcp_mem|tcp_rmem|tcp_wmem

# increase TCP memory
echo "net.ipv4.tcp_mem = 1104864 5872026 8388608" >> /etc/sysctl.conf
echo "net.ipv4.tcp_rmem = 4096 25165824 33554432" >> /etc/sysctl.conf
echo "net.ipv4.tcp_wmem = 4096 25165824 33554432" >> /etc/sysctl.conf

# view current tcp_mem
cat /proc/sys/net/ipv4/tcp_mem

Additional diagnostics to locate offending containers and processes:

# find containers with many fds
for i in `docker ps | grep Up | awk '{print $1}'`; do echo && docker top $i && echo ID=$i; done | grep -A 15 4078683

# count fds per pid
for pid in `ls -1 /proc/ | grep -Eo '[0-9]{1,}'`; do pnum=$(ls -1 /proc/${pid}/fd/ | wc -l); if [ $pnum -gt 1000 ]; then echo "${pid} ${pnum}"; fi; done

After fixing the TCP parameters, the service recovered within 30 minutes.

Kernel Parameter Monitoring & Optimization : The team cataloged 55 kernel metrics (e.g., vm.min_free_kbytes, fs.file-max, net.ipv4.tcp_rmem) and built automated collection via node‑exporter extensions and custom scripts. Visual dashboards and periodic audits ensure parameters stay within recommended ranges.

Container Security – Namespace Protection : Implements hard and soft delete policies. Critical namespaces (e.g., kube-system) are locked from deletion. Non‑critical namespaces use a soft‑delete mechanism that counts delete attempts and only allows removal after a threshold. Webhook interceptors enforce these rules, and a “control label” can downgrade checks for bulk deletions.

Ingress Configuration Validation : Webhooks reject insecure configurations such as wildcard hosts (rule.host = '*') to prevent accidental production outages.

Summary : The DeWu container SRE team applies systematic monitoring, incident response, capacity planning, and security controls to maintain high service reliability. Their practices illustrate how SRE principles can be concretely applied in a cloud‑native container environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance optimization SRE Container incident management

Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.