20 Must‑Know Production Ops Issues and Quick Fixes
This guide presents twenty common production‑environment problems—from log analysis and database recovery to Kubernetes scheduling—detailing real‑world scenarios, step‑by‑step command solutions, and preventive measures that help engineers quickly diagnose, resolve, and avoid outages.
1. Log Analysis: Quickly Identify Application Crash Causes
Production scenario: A web application crashes and fails to restart; logs show OutOfMemoryError indicating insufficient JVM heap.
Solution:
Run grep "OutOfMemoryError" /var/log/application.log to filter relevant entries.
Confirm memory overflow, adjust JVM settings by editing JAVA_OPTS to include -Xms2g -Xmx4g.
Prevention: Configure log rotation to prevent excessive disk usage.
2. Backup & Recovery: Restoring Lost Database Data
Production scenario: An important table is accidentally deleted.
Solution:
Restore from backup: mysql -u root -p < /backup/backup_2024-12-01.sql If no backup, use MySQL binlog:
mysqlbinlog /var/lib/mysql/mysql-bin.000001 | mysql -u root -pPrevention: Perform regular backups, verify them, and enable incremental backups.
3. Disk Management: Disk Space Exhaustion
Production scenario: Disk is full, causing service interruption.
Solution:
Check usage: df -h and du -sh /var/log/* to locate large log files.
Delete expired logs or temporary files in /tmp.
Set up logrotate for automatic log rotation.
Prevention: Enable disk‑space monitoring and alerts.
4. Permission Management: Avoid Privilege Abuse
Production scenario: A developer with root rights mistakenly brings down services.
Solution:
Use sudo to grant minimal required privileges, e.g., user ALL=(ALL) NOPASSWD: /bin/systemctl restart nginx.
Configure audit rules with visudo to restrict unnecessary permissions.
Prevention: Regularly review permission settings.
5. Network Fault Diagnosis: Server Cannot Reach Internet
Production scenario: Server cannot access external networks, blocking updates and API calls.
Solution:
Ping external IP: ping 8.8.8.8.
Trace route: traceroute to locate the break point.
Inspect firewall rules: iptables -L for accidental blocks.
Prevention: Deploy network monitoring tools for stable connectivity.
6. Process Management: High Load Causes Slow Responses
Production scenario: High concurrency leads to CPU‑intensive processes, raising system load.
Solution:
Inspect CPU usage: top or htop (e.g., top -o %CPU).
Lower priority of heavy processes: renice -n 10 -p <PID>.
If the bottleneck is the database, optimize SQL or add indexes.
Prevention: Set up automated monitoring and proactive alerts.
7. Scheduled Tasks: Missed Backup Jobs
Production scenario: A periodic backup task fails, leading to data loss.
Solution:
Check crontab: crontab -l to verify task definitions.
Review /var/log/cron for execution logs.
Prevention: Use monitoring tools (e.g., Prometheus) to watch task success.
8. Service Management: Service Crash Without Auto‑Restart
Production scenario: Critical services like Nginx or MySQL crash and do not restart automatically.
Solution:
Enable systemd auto‑restart: systemctl enable nginx and systemctl restart nginx.
Set Restart=always in the service unit file.
Prevention: Configure service health monitoring.
9. High Availability: Load Balancer Failure
Production scenario: Load balancer stops routing traffic, causing user outages.
Solution:
Deploy Nginx or HAProxy as load balancer with multiple instances.
Configure health checks for backend servers (e.g.,
upstream backend { server backend1.example.com check; server backend2.example.com check; }).
Prevention: Use health‑check mechanisms to ensure backend availability.
10. Database Performance: Slow Queries Degrade Application Speed
Production scenario: Slow database responses make front‑end pages lag.
Solution:
Analyze with EXPLAIN: EXPLAIN SELECT * FROM orders WHERE order_date='2024-12-20'; Add indexes: CREATE INDEX idx_order_date ON orders(order_date); Prevention: Enable slow‑query logging and regularly review queries.
11. Containerization: Docker Container Resource Leak
Production scenario: A Docker container leaks memory, consuming excessive resources.
Solution:
Inspect usage: docker stats.
Check logs: docker logs <container_id> for leak signs.
Prevention: Set memory limits, e.g., docker run -m 512m --memory-swap 1g my-container.
12. Network Security: Prevent DDoS Attacks
Production scenario: Server overwhelmed by DDoS, exhausting bandwidth.
Solution:
Limit per‑IP request rate:
iptables -A INPUT -p tcp --dport 80 -m limit --limit 10/min -j ACCEPT.
Use services like Cloudflare or AWS Shield.
Prevention: Deploy traffic monitoring to detect anomalies early.
13. SSL/TLS Configuration: Ensure HTTPS Security
Production scenario: Site lacks HTTPS, exposing data to MITM attacks.
Solution:
Obtain Let’s Encrypt certificate: certbot --nginx -d example.com.
Force HTTP→HTTPS redirect in Nginx config.
Prevention: Regularly check certificate validity and renew before expiry.
14. Service Dependency: Microservice Startup Failure
Production scenario: One microservice fails to start, breaking dependent services.
Solution:
Orchestrate with docker-compose to enforce start order.
Configure health checks so downstream services start only after upstream health.
Prevention: Periodically test microservice resilience and high‑availability.
15. Automated Ops: Bulk Configuration Management
Production scenario: New batch of servers needs uniform firewall rules and packages.
Solution:
Use Ansible or Puppet, e.g., ansible-playbook -i inventory setup.yml, to apply configurations at scale.
Prevention: Integrate CI/CD pipelines to maintain consistent, secure server states.
16. Memory Leak: Detection and Resolution
Production scenario: Application’s memory usage continuously grows, eventually crashing the system.
Solution:
Monitor with top and free -h.
Use valgrind or jmap -histo:live <pid> to locate leaks.
Prevention: Regularly audit memory usage and schedule restarts to clear leaks.
17. Centralized Log Management
Production scenario: Multiple servers generate massive logs, making manual inspection impossible.
Solution:
Deploy ELK stack (Elasticsearch, Logstash, Kibana) for collection and analysis.
Use Filebeat to ship logs to Logstash.
Prevention: Set up log categorization and real‑time monitoring to avoid performance degradation.
18. Virtualization Management: KVM Host Performance Tuning
Production scenario: VMs under KVM run poorly, starving host resources.
Solution:
Adjust VM memory and CPU allocations.
Enable CPU pinning and hugepages via KVM settings.
Prevention: Periodically analyze host performance and rebalance resources.
19. Backup & Restore: Cloud Storage Data Recovery
Production scenario: Cloud‑hosted database or files are lost and need restoration.
Solution:
Utilize cloud provider tools (AWS S3, Google Cloud Storage) for backup and recovery.
Prevention: Configure automatic cloud backups and routinely verify backup integrity.
20. Service Scheduling: Kubernetes Cluster Scheduling Optimization
Production scenario: Uneven resource allocation in a Kubernetes cluster prevents some pods from starting.
Solution:
Inspect resources: kubectl describe pod <pod_name>.
Set appropriate resources.requests and resources.limits in pod specs.
Prevention: Regularly rebalance cluster resources to avoid node overload.
Full-Stack DevOps & Kubernetes
Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
