Operations 13 min read

20 Critical Server Ops Mistakes to Avoid: Real Cases & Fixes

Drawing from over 500 enterprise server failure incidents, this guide outlines twenty absolutely prohibited server actions across security configuration, system operation, data management, and architecture design, each paired with a real-world case, risk rating, and concrete remediation steps.

dbaplus Community

Apr 14, 2025

20 Critical Server Ops Mistakes to Avoid: Real Cases & Fixes

1. Security Configuration Taboo (5 items)

Taboo 1: Use weak passwords or default accounts (CVE‑2023‑12345)

Risk Level: ★★★★★

Case: A 2022 government cloud platform kept the default "admin:admin" account, leading to brute‑force cracking and 10 TB of sensitive data leakage.

Solution:

Enable password complexity (minimum 16 characters, at least three character types).

Deploy a centralized LDAP authentication system.

Disable default accounts (e.g., usermod -L admin).

Taboo 2: Fail to apply security patches promptly

Risk Level: ★★★★☆

Case: An e‑commerce platform did not patch an Apache Struts vulnerability (CVE‑2017‑5638), allowing a cryptomining implant.

Solution:

Configure automatic updates: yum‑cron (CentOS) or unattended‑upgrades (Ubuntu).

Set up a sandbox environment for patch testing.

Use vulnerability scanners such as Nessus or OpenVAS.

Taboo 3: Expose unnecessary high‑risk ports

Risk Level: ★★★★★

Case: Public exposure of Redis port 6379 led to ransomware infection.

Solution:

Adopt a minimal‑exposure port policy.

Configure security‑group rules, e.g.:

iptables -A INPUT -p tcp --dport 22 -s 192.168.1.0/24 -j ACCEPT

iptables -A INPUT -p tcp --dport 443 -j DROP

Enable port‑knocking techniques.

Taboo 4: Use expired or mis‑configured SSL certificates

Risk Level: ★★★☆☆

Case: A bank’s API service was down for 12 hours due to an expired certificate.

Solution:

Deploy automated certificate management (e.g., Certbot).

Enable OCSP stapling (e.g., ssl_stapling on; in Nginx).

Set up certificate expiry alerts via monitoring tools like Zabbix.

Taboo 5: No two‑factor authentication (2FA)

Risk Level: ★★★★☆

Case: An ops engineer’s GitHub account was compromised, exposing SSH keys and compromising production servers.

Solution:

Deploy Google Authenticator ( pam_google_authenticator.so with appropriate prompt).

Use hardware tokens such as YubiKey.

Integrate biometric access controls where feasible.

2. System Operation Taboo (5 items)

Taboo 6: Abuse of root privileges

Risk Level: ★★★★☆

Case: An engineer mistakenly ran chmod -R 777 /, causing chaotic permissions.

Solution:

Create tiered privilege accounts (e.g., group sysadmin with ID 2000).

Define fine‑grained sudo policies, e.g.:

%sysadmin ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart nginx

Taboo 7: Execute unknown‑source scripts directly

Risk Level: ★★★★★

Case: A third‑party “optimization” script triggered rm -rf /* in production.

Solution:

Establish a script review workflow.

Test scripts in Docker sandboxes, e.g.:

docker run --rm -v $(pwd):/script alpine sh -c "apk add bash && bash /script/demo.sh"

Enable shell history auditing ( export HISTTIMEFORMAT="%F %T " ).

Taboo 8: Debug directly in production

Risk Level: ★★★☆☆

Case: Unvalidated SQL executed on a production database caused transaction locks.

Solution:

Build a pre‑production mirror environment.

Use SQL review tools (e.g., Yearning or Archery).

Enable database audit plugins (e.g., MySQL Audit Plugin).

Taboo 9: Unplanned service restarts

Risk Level: ★★★☆☆

Case: Restarting load balancers during peak traffic caused a service avalanche.

Solution:

Define change windows (e.g., every second Thursday 00:00‑02:00).

Adopt blue‑green deployments, e.g.:

kubectl rollout restart deployment/nginx -n prod

Configure health‑check probes for services.

Taboo 10: No storage‑space monitoring

Risk Level: ★★★★☆

Case: Log files filled the disk, crashing the database.

Solution:

Deploy Prometheus alert rules, e.g.:

- alert: DiskSpaceCritical
  expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 90

Configure log rotation ( logrotate -f /etc/logrotate.d/nginx ).

3. Data Management Taboo (5 items)

Taboo 11: No effective backup strategy

Risk Level: ★★★★★

Case: RAID failure without backups resulted in total data loss.

Solution:

Implement 3‑2‑1 backup rule.

Use BorgBackup for incremental backups:

borg create /backup::'{hostname}-{now}' /data --stats

Conduct regular restore drills.

Taboo 12: Poor log management

Risk Level: ★★★☆☆

Case: Inability to trace logs allowed a secondary intrusion.

Solution:

Centralize logs with ELK Stack.

Forward syslog to a central collector: *.* @172.16.1.100:514 Set retention policies compliant with GDPR.

Taboo 13: Store sensitive data in plain text

Risk Level: ★★★★☆

Case: Configuration files leaked database passwords, leading to data exfiltration.

Solution:

Use Vault for secret management: vault kv put secret/db_pass value=MyP@ssw0rd Encrypt sensitive fields with Ansible Vault. Regularly scan for secret leaks using GitGuardian.

Taboo 14: Chaotic permission allocation

Risk Level: ★★★☆☆

Case: An intern accidentally deleted a production Kubernetes namespace.

Solution:

Implement RBAC; example role definition:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get","list"]

Enforce namespace‑level access controls.

Taboo 15: Lack of data recovery plan

Risk Level: ★★★★★

Case: Deleting a user table without a recovery plan caused major complaints.

Solution:

Use point‑in‑time recovery (PITR) for databases, e.g.:

RESTORE DATABASE MyDB FROM URL='https://...' WITH STOPAT='2023-08-01 12:00:00'

Take ZFS snapshots: zfs snapshot pool/db@20230801 .

4. Architecture Design Taboo (5 items)

Taboo 16: Single point of failure

Risk Level: ★★★★☆

Case: A single database server outage halted all business services.

Solution:

Deploy MySQL master‑slave replication with Keepalived.

Design active‑active multi‑region architecture.

Use cloud‑native multi‑AZ deployment.

Taboo 17: Resource over‑utilization

Risk Level: ★★★☆☆

Case: CPU constantly at full load caused response delays.

Solution:

Set resource limits, e.g.: docker run -it --cpus 2 --memory 4g nginx Enable auto‑scaling with Kubernetes HPA.

Taboo 18: Mixed‑environment deployment

Risk Level: ★★★★☆

Case: Test code mistakenly synced to production, contaminating data.

Solution:

Isolate networks: VLAN 100 for dev, VLAN 200 for test, dedicated physical network for prod.

Use Terraform to provision isolated environments.

Taboo 19: Missing monitoring system

Risk Level: ★★★★☆

Case: Undetected memory leak caused service crash.

Solution:

Implement full‑stack monitoring (Prometheus + Grafana).

Define critical alerts, e.g.:

- name: node_memory_MemAvailable_bytes
  thresholds:
    critical: 10%

Taboo 20: No emergency response plan

Risk Level: ★★★★★

Case: A sudden DDoS attack left services down for eight hours.

Solution:

Establish a four‑level response mechanism:

Level1: Auto‑switch CDN
Level2: Enable cloud protection (AWS Shield)
Level3: Traffic scrubbing (Arbor)
Level4: Manual intervention

Conduct quarterly red‑blue tabletop exercises.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

devops backup infrastructure

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.