Operations 17 min read

20 Must‑Know Production Ops Issues and Quick Fixes

This guide presents twenty common production‑environment problems—from log analysis and database recovery to Kubernetes scheduling—detailing real‑world scenarios, step‑by‑step command solutions, and preventive measures that help engineers quickly diagnose, resolve, and avoid outages.

Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
20 Must‑Know Production Ops Issues and Quick Fixes

1. Log Analysis: Quickly Identify Application Crash Causes

Production scenario: A web application crashes and fails to restart; logs show OutOfMemoryError indicating insufficient JVM heap.

Solution:

Run grep "OutOfMemoryError" /var/log/application.log to filter relevant entries.

Confirm memory overflow, adjust JVM settings by editing JAVA_OPTS to include -Xms2g -Xmx4g.

Prevention: Configure log rotation to prevent excessive disk usage.

2. Backup & Recovery: Restoring Lost Database Data

Production scenario: An important table is accidentally deleted.

Solution:

Restore from backup: mysql -u root -p < /backup/backup_2024-12-01.sql If no backup, use MySQL binlog:

mysqlbinlog /var/lib/mysql/mysql-bin.000001 | mysql -u root -p

Prevention: Perform regular backups, verify them, and enable incremental backups.

3. Disk Management: Disk Space Exhaustion

Production scenario: Disk is full, causing service interruption.

Solution:

Check usage: df -h and du -sh /var/log/* to locate large log files.

Delete expired logs or temporary files in /tmp.

Set up logrotate for automatic log rotation.

Prevention: Enable disk‑space monitoring and alerts.

4. Permission Management: Avoid Privilege Abuse

Production scenario: A developer with root rights mistakenly brings down services.

Solution:

Use sudo to grant minimal required privileges, e.g., user ALL=(ALL) NOPASSWD: /bin/systemctl restart nginx.

Configure audit rules with visudo to restrict unnecessary permissions.

Prevention: Regularly review permission settings.

5. Network Fault Diagnosis: Server Cannot Reach Internet

Production scenario: Server cannot access external networks, blocking updates and API calls.

Solution:

Ping external IP: ping 8.8.8.8.

Trace route: traceroute to locate the break point.

Inspect firewall rules: iptables -L for accidental blocks.

Prevention: Deploy network monitoring tools for stable connectivity.

6. Process Management: High Load Causes Slow Responses

Production scenario: High concurrency leads to CPU‑intensive processes, raising system load.

Solution:

Inspect CPU usage: top or htop (e.g., top -o %CPU).

Lower priority of heavy processes: renice -n 10 -p <PID>.

If the bottleneck is the database, optimize SQL or add indexes.

Prevention: Set up automated monitoring and proactive alerts.

7. Scheduled Tasks: Missed Backup Jobs

Production scenario: A periodic backup task fails, leading to data loss.

Solution:

Check crontab: crontab -l to verify task definitions.

Review /var/log/cron for execution logs.

Prevention: Use monitoring tools (e.g., Prometheus) to watch task success.

8. Service Management: Service Crash Without Auto‑Restart

Production scenario: Critical services like Nginx or MySQL crash and do not restart automatically.

Solution:

Enable systemd auto‑restart: systemctl enable nginx and systemctl restart nginx.

Set Restart=always in the service unit file.

Prevention: Configure service health monitoring.

9. High Availability: Load Balancer Failure

Production scenario: Load balancer stops routing traffic, causing user outages.

Solution:

Deploy Nginx or HAProxy as load balancer with multiple instances.

Configure health checks for backend servers (e.g.,

upstream backend { server backend1.example.com check; server backend2.example.com check; }

).

Prevention: Use health‑check mechanisms to ensure backend availability.

10. Database Performance: Slow Queries Degrade Application Speed

Production scenario: Slow database responses make front‑end pages lag.

Solution:

Analyze with EXPLAIN: EXPLAIN SELECT * FROM orders WHERE order_date='2024-12-20'; Add indexes: CREATE INDEX idx_order_date ON orders(order_date); Prevention: Enable slow‑query logging and regularly review queries.

11. Containerization: Docker Container Resource Leak

Production scenario: A Docker container leaks memory, consuming excessive resources.

Solution:

Inspect usage: docker stats.

Check logs: docker logs <container_id> for leak signs.

Prevention: Set memory limits, e.g., docker run -m 512m --memory-swap 1g my-container.

12. Network Security: Prevent DDoS Attacks

Production scenario: Server overwhelmed by DDoS, exhausting bandwidth.

Solution:

Limit per‑IP request rate:

iptables -A INPUT -p tcp --dport 80 -m limit --limit 10/min -j ACCEPT

.

Use services like Cloudflare or AWS Shield.

Prevention: Deploy traffic monitoring to detect anomalies early.

13. SSL/TLS Configuration: Ensure HTTPS Security

Production scenario: Site lacks HTTPS, exposing data to MITM attacks.

Solution:

Obtain Let’s Encrypt certificate: certbot --nginx -d example.com.

Force HTTP→HTTPS redirect in Nginx config.

Prevention: Regularly check certificate validity and renew before expiry.

14. Service Dependency: Microservice Startup Failure

Production scenario: One microservice fails to start, breaking dependent services.

Solution:

Orchestrate with docker-compose to enforce start order.

Configure health checks so downstream services start only after upstream health.

Prevention: Periodically test microservice resilience and high‑availability.

15. Automated Ops: Bulk Configuration Management

Production scenario: New batch of servers needs uniform firewall rules and packages.

Solution:

Use Ansible or Puppet, e.g., ansible-playbook -i inventory setup.yml, to apply configurations at scale.

Prevention: Integrate CI/CD pipelines to maintain consistent, secure server states.

16. Memory Leak: Detection and Resolution

Production scenario: Application’s memory usage continuously grows, eventually crashing the system.

Solution:

Monitor with top and free -h.

Use valgrind or jmap -histo:live <pid> to locate leaks.

Prevention: Regularly audit memory usage and schedule restarts to clear leaks.

17. Centralized Log Management

Production scenario: Multiple servers generate massive logs, making manual inspection impossible.

Solution:

Deploy ELK stack (Elasticsearch, Logstash, Kibana) for collection and analysis.

Use Filebeat to ship logs to Logstash.

Prevention: Set up log categorization and real‑time monitoring to avoid performance degradation.

18. Virtualization Management: KVM Host Performance Tuning

Production scenario: VMs under KVM run poorly, starving host resources.

Solution:

Adjust VM memory and CPU allocations.

Enable CPU pinning and hugepages via KVM settings.

Prevention: Periodically analyze host performance and rebalance resources.

19. Backup & Restore: Cloud Storage Data Recovery

Production scenario: Cloud‑hosted database or files are lost and need restoration.

Solution:

Utilize cloud provider tools (AWS S3, Google Cloud Storage) for backup and recovery.

Prevention: Configure automatic cloud backups and routinely verify backup integrity.

20. Service Scheduling: Kubernetes Cluster Scheduling Optimization

Production scenario: Uneven resource allocation in a Kubernetes cluster prevents some pods from starting.

Solution:

Inspect resources: kubectl describe pod <pod_name>.

Set appropriate resources.requests and resources.limits in pod specs.

Prevention: Regularly rebalance cluster resources to avoid node overload.

MonitoringoperationsDevOps
Full-Stack DevOps & Kubernetes
Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.