Real‑World Ops Pitfalls and Proven Ways to Avoid Them
This article compiles practical experiences from system administrators about common operational pitfalls, their root causes, and concrete mitigation steps, ranging from misconfigured HAProxy timeouts and risky rm commands to ansible async quirks and cron‑job failures.
Overview
System administrators frequently encounter configuration oversights, scripting errors, and process‑related issues that cause intermittent failures. Sharing concrete examples and fixes helps teams diagnose and prevent similar problems.
HAProxy timeout unit omission
In a HAProxy configuration the timeout values were specified without a time unit (e.g., 10 instead of 10s). HAProxy interprets bare numbers as milliseconds, resulting in very short timeouts and intermittent connection failures. The fix is to append the appropriate unit, such as 10s, to all timeout directives.
RHEL 5 yum authentication bug
On RHEL 5 the yum client requires authentication, unlike CentOS. An attempt to collect diagnostics with sosreport was interrupted using Ctrl+C, which caused the server to crash. This was a known bug fixed in RHEL 6; the workaround on RHEL 5 is to avoid interrupting sosreport or to upgrade to a newer release.
Dangerous rm usage
Using rm with a variable that may be empty can delete unintended directories (e.g., rm -rf $DIR/tmp when $DIR is empty removes /tmp). The safe practice is to:
Never use rm with unexpanded variables.
Test commands in a non‑production environment first.
Document every step and run destructive commands during off‑peak hours.
Missing cron job for data cleanup
A scheduled script that should purge old data was never added to crontab, leading to a table scan on a multi‑million‑row table and severe performance degradation. The resolution is to ensure critical maintenance scripts are added to cron and verified.
Group membership update delay
After removing an auxiliary group with usermod -G, the groups command still showed the old membership until the user logged out and back in. The fix is to re‑login (or start a new session) to refresh group information.
Service monitor cron race condition
A cron job that monitors a service and restarts it if missing continued to read a stale /etc/crontab entry after the job was removed, causing the service to be resurrected after manual termination. Adding an explicit step to kill the process after removal prevents the race condition.
Ansible async task pitfalls
Pitfall 1 – The chdir parameter is ignored in an async task; specify the working directory inside the command or use a wrapper script.
Pitfall 2 – Nesting async tasks can cause the inner task to be skipped with an estimated failure probability of ~10 %. Avoid nesting; instead, chain tasks sequentially or use a single async block.
Port binding issue
Binding a public web service to a non‑standard port such as 87 leads browsers to block access. Use standard ports (e.g., 80 or 443) for public services or configure the browser/ firewall to allow the custom port.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
