10 Critical Server Ops Mistakes to Avoid: Real-World Lessons
This article outlines ten critical server operation mistakes—ranging from forced power cuts to neglecting updates—illustrated with real-world incidents and practical advice, helping engineers adopt safer practices, proper backups, secure configurations, and effective monitoring to prevent costly outages.
1. Forced Power Off
Forcefully cutting power can damage file systems, lose in‑memory data, and erase RAID controller caches. The proper approach is to shut down gracefully using commands such as shutdown -h now.
Case: A logistics company’s ops staff pulled a server’s power plug to fix a fault quickly, causing chaos in 200,000 orders and costly recovery.
2. Experimenting in Production
Running arbitrary commands (e.g., rm -rf) on production servers can delete critical files and crash services. Use command aliases for protection, such as alias rm='rm -i'.
A developer executed rm -rf ./tmp/* in production; a symlink pointed to the root directory, deleting system files and causing a 72‑hour outage.
3. Ignoring Firewall Rule Management
Clearing firewall rules or disabling the firewall exposes servers to threats. Always back up existing rules before making changes.
Lesson: An ops engineer disabled the firewall for convenience, leading to ransomware infection and encrypted data.
4. Running Unknown Scripts with Root
Executing third‑party scripts as root can implant malicious code. Review scripts before running and execute them with reduced privileges whenever possible.
Case: A company’s server ran an unreviewed third‑party script, becoming a mining bot.
5. Modifying Databases Without Backups
Altering database schemas or data without a backup can cause irreversible loss. Always create backup tables before making changes.
Case: A DBA changed a table structure without backup, resulting in severe data loss and a painful recovery process.
Summary: Implement appropriate backup strategies, choose reliable backup tools, and automate backups with scripts.
6. Misconfiguring SSH Security
Poor SSH settings—weak passwords or allowing password authentication—can lead to brute‑force attacks. Disable password login and enable key‑based authentication.
Case: Weak SSH credentials allowed attackers to hijack a server for cryptocurrency mining.
Best practice: Change the default port, disable root remote login, and use key‑pair authentication.
7. Neglecting Log Management
Improper log handling can cause log explosion or loss of critical information. Configure automatic log rotation and storage policies.
Case: A large Kafka cluster suffered a log‑burst, crippling the system.
Experience: Implement log collection, storage, analysis, and real‑time alerts to avoid missing key events.
8. Exposing Service Ports Unnecessarily
Using default ports or failing to restrict access can let attackers exploit services.
Case: An exposed Redis port allowed malicious actors to wipe data.
Advice: Minimize open ports, use CDNs or proxy services, and deploy IDS/IPS to monitor abnormal traffic.
9. Lack of Monitoring During Changes
Failing to monitor systems during upgrades or changes can let issues go unnoticed.
Case: An unsupervised night‑time upgrade caused a service avalanche lasting several hours.
Experience: Enforce strict change procedures, perform risk assessments, and limit emergency changes to maintain stability.
10. Ignoring System Updates and Patch Management
Delaying updates leaves vulnerabilities exploitable.
Lesson: A company ignored patches and fell victim to the Log4j vulnerability, resulting in data leakage and system compromise.
These prohibitions and real‑world lessons demonstrate that strict operational discipline is essential to prevent system failures and security incidents.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
