Ops Engineer Core Skills: From Basic Commands to High‑Availability Architecture
This article provides a comprehensive roadmap for operations engineers, covering essential Linux commands, core system concepts, service principles, fault‑diagnosis methods, high‑availability architecture designs, data security, backup strategies, performance tuning, and automation scripts to handle both single‑machine and large‑scale cluster environments.
1. Basic Operations and Core Concepts
Solid fundamentals are the cornerstone of efficient operations. Whether performing routine inspections or emergency troubleshooting, the following commands and concepts are indispensable.
1. Linux System Management Essential Commands
Resource monitoring : top, free -h, uptime to view CPU, memory and load; df -ia to check inode usage and avoid disk‑write failures caused by excessive small files.
Network and services : netstat -tulnp or ss -tulnp to view port usage; ps -ef | grep java to locate Java processes; route -n or ip route to see the default gateway.
Daily ops tricks : tail -f / tail -F for real‑time log tracking; kill -9 to force‑terminate a process; nginx -t to test configuration before nginx -s reload; $? to check the exit status of the previous command (0 = success).
2. Core Concept Clarifications
HTTP status codes : 200 (OK), 301 (permanent redirect), 302 (temporary redirect), 304 (client cache valid), 404 (not found), 499 (client closed connection).
File system : soft link (cross‑filesystem shortcut) vs hard link (shared inode, cannot cross filesystems); inode stores metadata, block stores actual data; buffer = write buffer, cache = read cache.
RAID levels : RAID 0 (striping, fastest but no redundancy), RAID 1 (mirroring, high safety), RAID 5 (distributed parity, balanced performance and redundancy).
Docker vs VM : Docker shares the host kernel, is lightweight and starts quickly; virtual machines emulate full hardware, provide strong isolation and pre‑allocated resources.
Main tools overview : Redis persistence – RDB (snapshot) and AOF (log); ELK stack – Elasticsearch (search/analysis), Logstash (pipeline), Kibana (visualisation); Keepalived implements HA based on the VRRP protocol.
2. Service Principles and Fault Diagnosis
1. Core Service Underlying Principles
MySQL master‑slave replication : Master enables binlog to record DML; slave’s IO thread pulls binlog into relay log, SQL thread replays it for data sync.
LVS load‑balancing modes : NAT (IP rewrite, limited scalability), TUN (IP tunnel, cross‑subnet), DR (MAC rewrite, best performance, requires same broadcast domain).
HTTP and DNS resolution : HTTP uses TCP three‑way handshake; DNS follows recursive lookup from local cache → root → TLD → authoritative.
CDN workflow : Request resolves via DNS to global LB, dispatched to optimal node; cached node returns directly, otherwise origin is fetched and cached.
Nginx + PHP (FastCGI) : Nginx forwards PHP requests to PHP‑FPM socket; PHP‑FPM spawns child processes, processes the request, and returns the result to Nginx.
2. Common Fault‑Diagnosis Thinking
Nginx 502/504/500 errors : 502 – backend service down or port unreachable; 504 – backend timeout (increase timeout or optimise); 500 – internal error (code bug, DB connection failure, disk/handle shortage).
MySQL performance issues : Slow queries require enabling the slow‑query log and analysing with explain; many Sleep threads usually indicate unclosed connections, adjust wait_timeout.
Web server slowness : Follow the “outside‑in” principle – check client network, server CPU/IO/bandwidth, service concurrency, DB slow queries, and static resource caching.
Disk space but cannot write : Root cause is inode exhaustion; diagnose with df -ia and clean unnecessary files.
3. Architecture Design and High Availability
When business scale expands, a single‑machine architecture can no longer meet demand; high‑availability and load‑balancing become the norm.
1. Portal Site HA Solution
Typical tiered topology : client → CDN → load‑balancer layer → application layer → cache layer → database layer → storage layer.
Load‑balancing options : LVS + Keepalived (layer 4, handles massive concurrency), Nginx + Keepalived (layer 7, static‑dynamic separation), cloud SLB.
Data‑layer HA : MySQL uses master‑slave replication plus MHA/MMM automatic failover; Redis evolves from master‑slave to Sentinel (auto failover) to Redis Cluster (decentralised sharding).
2. Containerization and Kubernetes (K8s)
Master components : kube-apiserver (cluster entry), kube-controller-manager (resource control), kube-scheduler (pod scheduling), etcd (key‑value store).
Node components : kubelet (node agent), kube-proxy (network proxy), container runtime (Docker or containerd).
Storage mechanisms : PV (admin‑created persistent volume) and PVC (developer‑requested storage) supporting RWO (single‑node read/write), ROX (multi‑node read‑only), RWX (multi‑node read/write).
3. Enterprise‑grade CI/CD Automation
Standard pipeline : GitLab code push triggers WebHook → Jenkins builds, packages and tests → Ansible/SaltStack performs batch deployment → K8s/Docker runs containers, achieving full‑process automation and gray‑release.
4. Data Security, Backup and Performance Optimization
1. Data Backup and Disaster Recovery
MySQL backup strategy : For large data (>500 GB) use XtraBackup for hot physical backup; for smaller data use mysqldump for logical backup. Single‑table recovery can be done by extracting with sed/grep from a full dump.
Enterprise DR : Combine local DR (RAID, master‑slave, scheduled backups) with off‑site DR (cross‑datacenter sync, OSS remote storage) and conduct regular restore drills.
2. System Security Hardening
Basic hardening : Close unused ports, enforce SSH key login (disable password), change default ports, grant minimal sudo privileges.
Attack mitigation : Enable tcp_syncookies to defend SYN flood, use WAF/firewall rate‑limiting, configure Nginx anti‑hotlink, SQL‑injection filters and HTTPS.
3. Core Component Performance Tuning
Nginx : Enable epoll, increase worker processes, turn on gzip, configure static cache and anti‑hotlink.
MySQL : Use SSDs and large memory, increase innodb_buffer_pool_size, create appropriate indexes to avoid full table scans.
Elasticsearch : Deploy on SSD, keep JVM heap ≤ 32 GB, set reasonable shard/replica numbers, use bulk API for efficient writes.
4. Common Automation Script Scenarios
Log rotation : Script splits Nginx logs daily and reloads the service.
Alive monitoring : Ping IP list to check server status or query show slave status to monitor MySQL replication and trigger alerts.
Periodic cleanup : Use find /path -name "*.png" -mtime +15 -delete to delete files older than 15 days.
Mastering this knowledge map enables operators to confidently tackle challenges ranging from single‑machine deployments to large‑scale clusters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xiao Liu Lab
An operations lab passionate about server tinkering 🔬 Sharing automation scripts, high-availability architecture, alert optimization, and incident reviews. Using technology to reduce overtime and experience to avoid major pitfalls. Follow me for easier, more reliable operations!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
