Operations 24 min read

How Container SRE at DeWu Boosts Reliability: Practices, Metrics, and Incident Playbooks

This article details DeWu's container SRE approach, covering SRE fundamentals, on‑call response, SLO/SLA design, change management, capacity planning, kernel‑parameter monitoring, security safeguards, and a real‑world incident analysis, providing actionable insights for building resilient cloud‑native services.

dbaplus Community

Feb 28, 2023

How Container SRE at DeWu Boosts Reliability: Practices, Metrics, and Incident Playbooks

SRE Definition and Scope

Site Reliability Engineering (SRE) blends software engineering and operations. Practitioners split their time roughly 50 % on operational tasks (monitoring, logging, alerting, performance tuning) and 50 % on building software that improves service stability and scalability.

On‑Call and Incident Response

Each on‑call rotation consists of at most two engineers (a senior and a junior shadow). The response framework defines clear escalation paths and roles:

Incident Commander (IC) : coordinates the response without executing tasks.

Communication Lead (CL) : gathers information and communicates status.

Operations Lead (OL) : directs execution of runbooks.

Incident Responders (IR) : engineers, developers, DBAs, QA who perform the actual remediation.

Typical workflow:

Monitor services, detect latency or error spikes, and collect incident data.

Analyze root cause, then apply fixes via scripts, code changes, or automation.

SLO / SLA Design

Key concepts:

Absolute 100 % uptime is unattainable; focus on user‑centric Service Level Indicators (SLIs) such as latency, throughput, error rate, availability, durability.

Service Level Objectives (SLOs) set target values for specific SLIs.

Service Level Agreements (SLAs) formalize expectations with customers and may include penalties.

Increasing SLO targets incurs non‑linear cost, so balance cost versus benefit.

Change Management

Approximately 70 % of production incidents stem from deployment changes. Recommended practices:

Use progressive rollout mechanisms (canary, blue‑green, feature flags).

Detect regressions quickly with automated health checks and alerting.

Provide safe rollback paths (e.g., versioned deployments, immutable images).

Capacity Planning

Essential steps:

Maintain a growth‑forecast model that exceeds the lead time for resource acquisition.

Track non‑natural demand sources (marketing campaigns, seasonal spikes).

Run periodic stress tests to map raw resource limits to business capacity.

Four Golden Monitoring Metrics

Effective monitoring and alerting focus on four dimensions:

Latency – measured in milliseconds; affected by packet loss, congestion, jitter.

Traffic – workload pressure expressed as QPS/TPS.

Error – service error rate, including explicit HTTP errors and implicit content errors.

Saturation – resource utilization (CPU, memory, disk, I/O) indicating overload.

Reliability Measurement

Reliability is expressed as Mean Time To Failure (MTTF) and Mean Time To Recovery (MTTR). Automation reduces MTTR, thereby improving overall availability.

Real‑World Container Incident: Redis Latency Spike

Background : A service experienced Redis response‑time spikes, causing timeouts across several pods.

Root‑cause analysis steps:

Rule out network latency (observed 2 ms → 4 ms, negligible).

Identify abnormal packet drops and high tcpofo / tcprcvq counters, indicating TCP memory limits.

Observe excessive I/O wait times (seconds‑level) via iowait metrics.

Query hosts with high iowait:

select * from cpus where time > now() - 4h and host = 'i-bp11f8g5h7oofu5pqgr8' and iowait > 50.0

Inspect kernel source (e.g., tcp_input.c in Linux v4.19) to understand that failed TCP buffer allocation increments drop counters.

Check current TCP memory settings: sysctl -a | grep -i tcp_mem Increase TCP memory limits:

# Expand TCP total memory
echo "net.ipv4.tcp_mem = 1104864 5872026 8388608" >> /etc/sysctl.conf
# Expand per‑socket read buffer
echo "net.ipv4.tcp_rmem = 4096 25165824 33554432" >> /etc/sysctl.conf
# Expand per‑socket write buffer
echo "net.ipv4.tcp_wmem = 4096 25165824 33554432" >> /etc/sysctl.conf

Verify the new settings: cat /proc/sys/net/ipv4/tcp_mem Identify processes with excessive file descriptors (FDs):

# List containers with a given PID
for i in $(docker ps | grep Up | awk '{print $1}'); do echo && docker top $i && echo ID=$i; done | grep -A 15 4078683
# Find high FD usage
for pid in $(ls -1 /proc/ | grep -Eo '[0-9]+'); do fds=$(ls -1 /proc/${pid}/fd/ | wc -l); if [ $fds -gt 1000 ]; then echo "${pid} ${fds}"; fi; done

After adjusting TCP buffers and cleaning stray sockets, the service recovered within ~30 minutes and metrics returned to normal.

Kernel Parameter Monitoring and Optimization

DeWu built a systematic framework to monitor and tune kernel parameters:

Identify relevant host network statistics from /proc/net/netstat.

Extend node-exporter with custom collectors for missing metrics (e.g., tcp.socket.mem).

Visualize 55 key network and kernel indicators, each mapped to business impact.

Representative parameters (grouped by workload type):

vm.min_free_kbytes – reserves a minimum amount of free memory; recommended < 5 % of total memory.

fs.file-max – system‑wide maximum number of file handles; monitor usage via /proc/sys/fs/file-nr.

net.netfilter.nf_conntrack_max – upper bound for connection‑track table; important for high‑concurrency workloads.

net.ipv4.tcp_max_syn_backlog – size of the half‑open connection queue; ensure it is not a bottleneck.

net.ipv4.tcp_rmem / net.ipv4.tcp_wmem – per‑socket read/write buffer sizes; tune based on observed tcpofo and tcprcvq drops.

Parameters are tuned per workload category (default, high‑density compute, big‑data) and version‑controlled in a Git repository.

Container Security Safeguards

Security controls are enforced via Kubernetes admission webhooks and a custom “Defender” component:

Namespace protection : Core namespaces (e.g., kube-system) are immutable; non‑core namespaces have rate‑limited soft‑delete policies.

CRD/CR protection : Webhook intercepts critical resources (Ingress, Service, ConfigMap, Secret, Pod) and can exempt bulk deletions via label‑based rules.

High‑risk configuration checks : Block dangerous Ingress rules such as host: '*' that would expose all domains.

Conclusion

The practices described—structured on‑call, SLO‑driven reliability targets, progressive change management, capacity forecasting, the four‑metric monitoring model, kernel‑parameter observability, and admission‑webhook security—form a systematic approach to improve service reliability, reduce MTTR, and provide a scalable foundation for container‑based workloads.

Kubernetes SRE Reliability IncidentResponse CapacityPlanning

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.