Operations 8 min read

Real‑World Ops Pitfalls and Proven Ways to Avoid Them

This article compiles practical experiences from system administrators about common operational pitfalls, their root causes, and concrete mitigation steps, ranging from misconfigured HAProxy timeouts and risky rm commands to ansible async quirks and cron‑job failures.

ITPUB
ITPUB
ITPUB
Real‑World Ops Pitfalls and Proven Ways to Avoid Them

Overview

System administrators frequently encounter configuration oversights, scripting errors, and process‑related issues that cause intermittent failures. Sharing concrete examples and fixes helps teams diagnose and prevent similar problems.

HAProxy timeout unit omission

In a HAProxy configuration the timeout values were specified without a time unit (e.g., 10 instead of 10s). HAProxy interprets bare numbers as milliseconds, resulting in very short timeouts and intermittent connection failures. The fix is to append the appropriate unit, such as 10s, to all timeout directives.

RHEL 5 yum authentication bug

On RHEL 5 the yum client requires authentication, unlike CentOS. An attempt to collect diagnostics with sosreport was interrupted using Ctrl+C, which caused the server to crash. This was a known bug fixed in RHEL 6; the workaround on RHEL 5 is to avoid interrupting sosreport or to upgrade to a newer release.

Dangerous rm usage

Using rm with a variable that may be empty can delete unintended directories (e.g., rm -rf $DIR/tmp when $DIR is empty removes /tmp). The safe practice is to:

Never use rm with unexpanded variables.

Test commands in a non‑production environment first.

Document every step and run destructive commands during off‑peak hours.

Missing cron job for data cleanup

A scheduled script that should purge old data was never added to crontab, leading to a table scan on a multi‑million‑row table and severe performance degradation. The resolution is to ensure critical maintenance scripts are added to cron and verified.

Group membership update delay

After removing an auxiliary group with usermod -G, the groups command still showed the old membership until the user logged out and back in. The fix is to re‑login (or start a new session) to refresh group information.

Service monitor cron race condition

A cron job that monitors a service and restarts it if missing continued to read a stale /etc/crontab entry after the job was removed, causing the service to be resurrected after manual termination. Adding an explicit step to kill the process after removal prevents the race condition.

Ansible async task pitfalls

Pitfall 1 – The chdir parameter is ignored in an async task; specify the working directory inside the command or use a wrapper script.

Pitfall 2 – Nesting async tasks can cause the inner task to be skipped with an estimated failure probability of ~10 %. Avoid nesting; instead, chain tasks sequentially or use a single async block.

Port binding issue

Binding a public web service to a non‑standard port such as 87 leads browsers to block access. Use standard ports (e.g., 80 or 443) for public services or configure the browser/ firewall to allow the custom port.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsDevOpsLinuxtroubleshootingSysadminAnsible
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.