Operations 21 min read

Mastering Server Performance: A Practical Guide to CPU, Memory, and I/O Optimization

This article provides a comprehensive guide to server performance optimization, covering the fundamentals of CPU, memory, and I/O analysis, practical methodologies, essential tools, and real‑world case studies to help operations engineers identify bottlenecks and improve system stability.

Efficient Ops

Aug 8, 2017

Mastering Server Performance: A Practical Guide to CPU, Memory, and I/O Optimization

Introduction: In operations work, besides maintaining platform stability, you must also optimize server performance; ensuring good performance is the foundation for stable operation. Tencent Interactive DBA team member Wang Wei (Simon) compiled a set of performance‑optimization materials to provide ample direction for performance improvement.

Overview

What Is Performance?

The most intuitive metric for performance is "time"; CPU utilization represents the proportion of time the CPU spends computing, and disk utilization represents the proportion of time spent on disk operations.

When CPU utilization reaches 100%, some requests cannot be processed in time, leading to increased response latency or timeouts.

When disk utilization reaches 100%, some requests must wait for I/O, also increasing latency or causing timeouts.

In other words, if all operations complete within ideal time, there is no performance‑optimization problem. Performance analysis starts by identifying what causes response time slowdown, typically focusing on CPU and I/O because applications are usually CPU‑bound or I/O‑bound.

CPU‑bound means compute‑intensive; I/O‑bound means read/write‑intensive. Memory issues often manifest as CPU or I/O bottlenecks because memory is designed to improve kernel instruction and application read/write performance.

Insufficient memory can trigger heavy swapping, making the disk a bottleneck; page faults, memory allocation, release, copying, and address‑space mapping can cause CPU bottlenecks. Severe memory problems may affect functionality, which goes beyond pure performance.

Performance optimization is not isolated; besides response time, you must also consider functional completeness, security, and other aspects.

Fundamentals of Performance Analysis

Effective performance optimization requires solid foundational knowledge:

Operating System – Manages all resources needed by applications, such as CPU and I/O. Issues in file system type, disk type, RAID configuration, etc., are all OS‑managed.

System Programming Techniques – Determines how to use system resources, e.g., buffered I/O vs. direct I/O, synchronous vs. asynchronous, multi‑process vs. multi‑thread.

Application Layer – Database component types, engines, indexes, replication, configuration parameters, backup, high‑availability, etc., can all be sources of performance problems.

Performance Analysis Methodology

Problem‑analysis frameworks such as Pyramid Thinking, 5W2H, and McKinsey’s Seven‑Step method provide direction. Applying 5W2H yields questions like:

What – What is the observed phenomenon?

When – When does it occur?

Why – Why does it happen?

Where – Where does it happen?

How much – How many resources are consumed, and how much can be saved after fixing?

How to do – How to solve it?

Beyond these high‑level guides, Brendan Gregg’s "The Performance Handbook" (Chapter 2) introduces concrete methods such as Use‑case analysis, load‑characteristic summarization, performance monitoring, static tuning, latency analysis, and tool‑based approaches.

CPU

Understanding the CPU

Key concepts include processor, core, hardware thread, CPU cache, clock frequency, CPI/IPC, instruction set, utilization, user vs. kernel time, scheduler, run queue, preemption, multi‑process, multi‑thread, and word size.

For applications, we usually focus on kernel CPU scheduler behavior and performance.

Thread‑state analysis distinguishes:

on‑CPU : Executing (user time + system time). off‑CPU : Waiting for the next CPU slice, I/O, locks, paging, etc., with sub‑states such as runnable, anonymous paging, sleep, lock, idle.

If a large portion of time is on‑CPU, CPU profiling quickly reveals the cause; if most time is off‑CPU, diagnosis becomes more time‑consuming.

Analysis Methods and Tools

When observing CPU performance, use load‑characteristic summarization to check:

Overall system CPU load and per‑CPU utilization.

Concurrency of CPU load (single‑threaded or multi‑threaded, thread count).

Which application and how much CPU it consumes.

Which kernel thread consumes CPU.

Interrupt CPU usage.

User‑space vs. kernel‑space call paths.

Types of stall cycles encountered.

Answering these questions is most economical with system performance tools:

Tool

Description

uptime

Average load

vmstat

System‑wide CPU average load

top

Monitor per‑process/thread CPU usage

pidstat

Breakdown of CPU usage per process/thread

Process state

perf

CPU profiling, performance counters

For call‑path and stall‑cycle analysis, perf or DTrace can be used.

Practical Case

Flame graphs help visualize CPU call paths. In a MySQL non‑in‑place update benchmark, perf top showed function call frequencies, while flame graphs revealed the hierarchical call relationships.

Memory

Understanding Memory

Key memory concepts include physical memory, virtual memory, resident set, address space, OOM, page cache, page faults, swapping, swap space, allocators (libc, glibc, libmalloc, mtmalloc), and the Linux SLUB allocator.

Analysis Methods and Tools

Brendan’s book suggests examining memory‑bus balance, NUMA node allocation, etc., but practical analysis follows a checklist:

System‑wide physical and virtual memory usage.

Swapping, OOM events.

Kernel and filesystem cache usage.

Per‑process memory distribution.

Reasons for process memory allocation.

Reasons for kernel memory allocation.

Processes that continuously swap.

Potential memory leaks.

Typical tools:

Tool

Description

free

Cache size statistics

vmstat

Virtual memory statistics

top

Monitor per‑process memory usage

Process state

DTrace

Allocation tracing

Only allocation tracing (e.g., DTrace) can pinpoint memory leaks; other tools provide statistical views.

Practical Case

A memory‑leak investigation revealed that a Lua script allocated memory quickly; the driver’s periodic service reclaimed memory in bulk, causing occasional CPU pressure. The solution was staged reclamation: reclaim a portion each cycle and perform full reclamation periodically.

I/O

Logical I/O vs. Physical I/O

I/O load usually refers to disk I/O (physical I/O). Metrics from iostat such as avgqu‑sz, svctm, and await describe this.

Most read/write operations go through the filesystem (VFS) rather than raw devices. The kernel checks page cache first; if data is missing, it issues block‑device requests, which the I/O scheduler dispatches to the disk driver.

Sequential reads benefit from prefetching; random reads may cause read amplification.

Write paths have similar amplification or reduction effects due to filesystem buffering, metadata, alignment, compression, etc.

Filesystem Analysis and Tools

Key filesystem concepts: filesystem, VFS, page cache, buffer cache, directory cache, inode, inode cache.

Filesystem cache structures store virtual memory pages, improving file and directory performance. The kernel’s kswapd thread writes back dirty pages when memory is low or after a timeout.

Filesystem latency includes time spent in the filesystem, kernel I/O subsystem, and waiting for the disk device.

Disk Analysis and Tools

Important disk concepts: virtual disk, sector, I/O request, command, bandwidth, throughput, latency, service time, wait time, random vs. sequential I/O, sync vs. async, interface, RAID.

Typical analysis checklist:

Per‑disk utilization.

Queue length per disk.

Average service and wait times.

Which application or user is using the disk.

Read/write patterns (random vs. sequential, sync vs. async).

Kernel call path that initiates I/O.

Read/write ratio.

Common tools:

Tool

Description

iostat

Per‑disk statistics

iotop, pidstat

Disk I/O per process

perf, DTrace

Tracing tools

In a MySQL non‑in‑place update benchmark, tracing block‑device events showed that single‑instance runs had ~30% blk_finish_plug and 70% blk_queue_bio, while multi‑instance runs showed the opposite distribution.

References

Brendan Gregg, Systems Performance: Enterprise and Cloud (http://www.brendangregg.com)

Robert Love, Linux Kernel Development

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance I/O CPU memory server optimization

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.