Rapidly Diagnose Production Bugs with Linux Tools, Performance Tricks & Design Patterns
This article guides developers through classifying system‑level and business‑level bugs, using Linux utilities like perf, ps, and vmstat for quick root‑cause analysis, and outlines effective code‑design patterns and architectural strategies—caching, rate‑limiting, and high‑availability—to prevent and resolve production incidents.
Hello, I'm Sanyou. As a programmer, encountering online issues is routine. When a group of colleagues gathers late at night, they're not discussing philosophy—they're troubleshooting a production bug. If a bug impacts core business processes, it becomes an incident, demanding immediate attention regardless of personal activities.
Bug Classification
Online problems are diverse; we generally split bugs into system‑level and business‑level categories.
System‑level Bugs
These affect the entire system, such as CPU saturation, service unavailability, or server crashes. Rapid resolution is critical.
Linux Diagnostic Tools
1. High CPU Usage (100%?)
perf is a Linux performance analysis tool that can generate flame graphs to visualize hotspot functions and call stacks, making it easy to spot functions that consume excessive CPU.
In the famous "713B station incident," the team used perf to produce a flame graph that identified a Lua hotspot function causing 100% CPU usage.
2. Inspecting a Suspect Process
ps command helps list processes and their resource usage.
<code># ps -ef | grep queuejob
root 1303 1 0 Apr17 ? 00:00:00 /usr/sbin/queuejob
root 3260 3087 0 Apr17 ? 00:00:00 /usr/bin/queuejob /bin/sh -c exec -l /bin/bash -c "env GNOME_SHELL_SESSION_MODE=classic gnome-session --session gnome-classic"
root 24174 19508 0 11:39 pts/0 00:00:00 grep --color=auto ssh</code> <code># ps aux | sort -nk 3</code> <code># ps aux | sort -rnk 4</code>3. Memory & Disk Usage
vmstat (Virtual Memory Statistics) monitors memory, disk, and CPU metrics.
Key fields include:
<code>Procs: r (running queue), b (blocked I/O)
Memory: swpd (swap used), free, buff, cache
Swap: si (swap in), so (swap out)
IO: bi (blocks in), bo (blocks out)
System: in (interrupts), cs (context switches)
CPU (%): us (user), sy (system), id (idle), wa (IO wait)</code>These metrics give a clear view of server performance.
Business‑level Bugs
When a bug is at the business level, the responsibility lies with developers or testers. The first step is to check logs . Comprehensive logging allows quick identification of the affected code path.
Next, inspect data . Verify whether database tables contain unexpected values; data issues often manifest as user‑reported anomalies or workflow errors. Always back up data before making changes, and involve DBAs for production modifications.
If data is correct, the problem likely resides in the code . Fixing a bug must not break other functionalities, so careful design and testing are essential.
Solution Design
Code Design
Most companies enforce coding standards: methods should have a single responsibility, stay under 100 lines, and classes under 1,000 lines. Clear structure aids code reviews.
Two common approaches are linear logic and encapsulation via design patterns. Below are two frequently used patterns.
Factory Pattern
The factory pattern defines an interface for creating objects, letting subclasses decide which class to instantiate. It promotes extensibility by delegating object creation to a factory class.
Creator: the factory class that creates product instances.
Abstract Product: defines the common interface for all products.
Concrete Product: actual implementations of the product.
Decorator Pattern
The decorator pattern adds responsibilities to objects dynamically without altering their structure or inheritance hierarchy, by wrapping the original object.
Main: core business logic.
MainComponent: concrete implementation of the core.
Decorator: interface for additional behavior.
DecoratorComponent: concrete implementation of the added behavior.
Use patterns judiciously; avoid over‑engineering or using obscure patterns that increase maintenance burden.
Architecture Design
High Performance & High Availability
Cache: store frequently accessed data to improve read speed and reduce database load.
Rate limiting & degradation: throttle traffic during spikes to protect core services.
Distributed systems & service splitting: decompose monoliths into independent services communicating via middleware.
High‑availability deployment: active‑active or active‑passive across data centers to survive node failures.
Incident Post‑mortem
After an incident, a detailed report should capture the timeline, responsible parties, root cause, resolution steps, and follow‑up actions. Post‑mortems drive continuous improvement and help prevent recurrence.
Sanyou's Java Diary
Passionate about technology, though not great at solving problems; eager to share, never tire of learning!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.