Operations 11 min read

What Makes a Great Linux Ops Engineer? 13 Practical Principles

This article shares a former developer‑turned‑ops manager’s 13‑point guide on treating operations as resource‑centric engineering, covering what ops is (and isn’t), writing reliable programs, abstracting resources, configuration management, monitoring, and practical tips for building stable, scalable Linux systems.

MaGe Linux Operations

Mar 11, 2015

What Makes a Great Linux Ops Engineer? 13 Practical Principles

1. What Operations Is Not

Operations is not a menial task, a customer‑service role, nor a service that merely supports developers; it requires genuine collaboration.

2. What Operations Is

Operations serves the entire product, ensuring sensible architecture and system stability, with a sole responsibility for business continuity.

3. Why Write Programs?

Programs exist to fulfill specific functions, which involve acquiring, using, managing, and eventually releasing resources such as memory, CPU, disk, network, file descriptors, external APIs, caches, and databases.

4. What Makes a Good Program?

Logically correct and resource‑efficient.

Bug‑free and does not exhaust machine resources.

Highly stable, never crashes unexpectedly.

Highly available with HA solutions, avoiding single points of failure.

Easily extensible by adding resources (CPU, memory, disk, machines) without complex migrations.

Easy to maintain, configure, deploy, and monitor.

5. How to Write Good Programs

Fewer lines of code mean fewer errors; simple logic, proper abstraction, layering, cohesion, decoupling, and resource isolation reduce mistakes. Stateless programs are easier to scale and provide HA. Simple configuration, rich logging, and state‑query capabilities aid operations.

6. What Is a System?

A system comprises network, machines, and programs organized into an architecture.

Machines should have single responsibilities; architectures should have simple data flows and service‑oriented components.

System design must consider change cost (scaling, adding/removing machines) rather than just current cost.

Operations should be simple enough for newcomers to adopt quickly.

7. Development and Operations Overlap

Less code often leads to more stability; reuse is key.

Operations can be less error‑prone because it focuses on resource management rather than complex business logic.

However, developers’ complex logic is part of the system that ops must support, creating inherent tension.

8. Understanding Your Resources

High CPU usage requires checking whether it’s sys, user, or iowait, and whether it’s a single core or overall.

Some programs are fine at 90% CPU, others at 350%.

Load average isn’t always CPU‑related; memory exhaustion can raise load and cause I/O spikes.

9. Monitoring Resources Correctly

Beyond disk usage, monitor disk I/O capacity, RAID health, and disk failures; for network, monitor packet loss as well as bandwidth.

10. Do All Resources Map to Hardware?

File descriptors, port counts, and process numbers are resources without direct hardware.

Routing tables, iptables, and cron jobs are also resources.

MySQL replication and third‑party REST APIs count as resources too.

11. Why Abstract Everything as Resources?

Linux treats everything as a file; abstracting all ops objects as resources enables uniform management (configuration, monitoring) and simplifies onboarding new machines.

12. Operations Principles

All online changes must go through configuration management; the live system should be read‑only for humans.

Before deployment, ask how to ensure HA, scalability, and operability/monitoring; if unanswered, delay the release.

Isolate complexity via abstraction, layering, and resource pools (e.g., cache pool, DB pool) managed by tools like Puppet.

Solve the immediate problem first, then design to prevent recurrence.

Avoid the same mistake three times; the third occurrence indicates a systemic issue.

Continuously look for ways to “work smarter,” reducing manual effort and increasing stability.

13. How Configuration Management Handles Resources

Packages: all software/scripts are installed via package managers (e.g., rpm).

Files: persistent changes are expressed as configuration files (sysctl, iptables, routes, cron, etc.).

Processes: services are started and managed through configuration files or init scripts.

When every aspect of a system can be expressed through these three abstractions and managed by configuration tools, scaling, upgrading, and lifecycle operations become trivial and less error‑prone.

Monitoring

Both correctness and business response time must be monitored.

Comprehensive baseline monitoring is essential, even if alerts aren’t real‑time; it helps uncover subtle issues like swap‑induced latency.

Ops Tips

Reinstall the OS and apply Puppet configuration to guarantee a known good state.

Separate stateless from stateful machines; centralize state to simplify management, and prefer stateless designs for easier ops.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Configuration Management Linux

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.