Operations 13 min read

Rapidly Diagnose Production Bugs with Linux Tools, Performance Tricks & Design Patterns

This article guides developers through classifying system‑level and business‑level bugs, using Linux utilities like perf, ps, and vmstat for quick root‑cause analysis, and outlines effective code‑design patterns and architectural strategies—caching, rate‑limiting, and high‑availability—to prevent and resolve production incidents.

Sanyou's Java Diary

Aug 11, 2022

Rapidly Diagnose Production Bugs with Linux Tools, Performance Tricks & Design Patterns

Hello, I'm Sanyou. As a programmer, encountering online issues is routine. When a group of colleagues gathers late at night, they're not discussing philosophy—they're troubleshooting a production bug. If a bug impacts core business processes, it becomes an incident, demanding immediate attention regardless of personal activities.

Bug Classification

Online problems are diverse; we generally split bugs into system‑level and business‑level categories.

System‑level Bugs

These affect the entire system, such as CPU saturation, service unavailability, or server crashes. Rapid resolution is critical.

Linux Diagnostic Tools

1. High CPU Usage (100%?)

perf is a Linux performance analysis tool that can generate flame graphs to visualize hotspot functions and call stacks, making it easy to spot functions that consume excessive CPU.

In the famous "713B station incident," the team used perf to produce a flame graph that identified a Lua hotspot function causing 100% CPU usage.

2. Inspecting a Suspect Process

ps command helps list processes and their resource usage.

# ps -ef | grep queuejob
root       1303  1  0 Apr17 ?        00:00:00 /usr/sbin/queuejob
root       3260 3087 0 Apr17 ?        00:00:00 /usr/bin/queuejob /bin/sh -c exec -l /bin/bash -c "env GNOME_SHELL_SESSION_MODE=classic gnome-session --session gnome-classic"
root      24174 19508 0 11:39 pts/0    00:00:00 grep --color=auto ssh

# ps aux | sort -nk 3

# ps aux | sort -rnk 4

3. Memory & Disk Usage

vmstat (Virtual Memory Statistics) monitors memory, disk, and CPU metrics.

Key fields include:

Procs: r (running queue), b (blocked I/O)
Memory: swpd (swap used), free, buff, cache
Swap: si (swap in), so (swap out)
IO: bi (blocks in), bo (blocks out)
System: in (interrupts), cs (context switches)
CPU (%): us (user), sy (system), id (idle), wa (IO wait)

These metrics give a clear view of server performance.

Business‑level Bugs

When a bug is at the business level, the responsibility lies with developers or testers. The first step is to check logs . Comprehensive logging allows quick identification of the affected code path.

Next, inspect data . Verify whether database tables contain unexpected values; data issues often manifest as user‑reported anomalies or workflow errors. Always back up data before making changes, and involve DBAs for production modifications.

If data is correct, the problem likely resides in the code . Fixing a bug must not break other functionalities, so careful design and testing are essential.

Solution Design

Code Design

Most companies enforce coding standards: methods should have a single responsibility, stay under 100 lines, and classes under 1,000 lines. Clear structure aids code reviews.

Two common approaches are linear logic and encapsulation via design patterns. Below are two frequently used patterns.

Factory Pattern

The factory pattern defines an interface for creating objects, letting subclasses decide which class to instantiate. It promotes extensibility by delegating object creation to a factory class.

Creator: the factory class that creates product instances.

Abstract Product: defines the common interface for all products.

Concrete Product: actual implementations of the product.

Decorator Pattern

The decorator pattern adds responsibilities to objects dynamically without altering their structure or inheritance hierarchy, by wrapping the original object.

Main: core business logic.

MainComponent: concrete implementation of the core.

Decorator: interface for additional behavior.

DecoratorComponent: concrete implementation of the added behavior.

Use patterns judiciously; avoid over‑engineering or using obscure patterns that increase maintenance burden.

Architecture Design

High Performance & High Availability

Cache: store frequently accessed data to improve read speed and reduce database load.

Rate limiting & degradation: throttle traffic during spikes to protect core services.

Distributed systems & service splitting: decompose monoliths into independent services communicating via middleware.

High‑availability deployment: active‑active or active‑passive across data centers to survive node failures.

Incident Post‑mortem

After an incident, a detailed report should capture the timeline, responsible parties, root cause, resolution steps, and follow‑up actions. Post‑mortems drive continuous improvement and help prevent recurrence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Incident Management backend operations bug troubleshooting Linux performance

Written by

Sanyou's Java Diary

Passionate about technology, though not great at solving problems; eager to share, never tire of learning!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.