Operations 30 min read

Common Production Faults and Their Handling Guide

This guide outlines the most common production failures—including network, server, database, software, security, storage, configuration, and third‑party service issues—and provides detailed steps for detecting, diagnosing, and resolving each type to maintain system stability and reliability.

Wukong Talks Architecture
Wukong Talks Architecture
Wukong Talks Architecture
Common Production Faults and Their Handling Guide

Common Production Faults

In production environments, typical fault types include network failures, server failures, database failures, software bugs, security vulnerabilities or attacks, storage failures, configuration errors, and third‑party service failures.

Network Failure Handling

How to Detect Network Failures

Connection status: check indicator lights on network devices and servers.

Ping test: use ping to verify communication with target devices.

Traffic monitoring: use tools such as Wireshark or ntop to spot abnormal packets, packet loss, or congestion.

Latency test: employ ping, traceroute, or MTR to measure network delay.

Log analysis: review server and network device logs for error messages related to connectivity.

How to Diagnose Network Failures

Check physical connections: ensure cables are properly inserted and undamaged.

Restart network equipment: reboot routers, switches, modems, etc.

Verify network configuration: confirm IP address, subnet mask, gateway, and other settings are correct.

Validate DNS settings: test domain name resolution via ping or direct IP access.

Inspect firewall rules: ensure no rules are blocking legitimate traffic.

Test other devices or websites to determine if the issue is isolated.

How to Resolve Network Failures

Repair physical cabling: reseat or replace damaged cables.

Restart routers, switches, or other network devices.

Correct network configuration errors.

Contact the ISP if the problem is beyond local control.

Server Failure Handling

How to Detect Server Failures

No response: inability to reach services or websites hosted on the server.

Error logs: examine system and application logs for failure‑related entries.

Monitoring tools: watch CPU, memory, disk usage, and other performance metrics for anomalies.

How to Diagnose Server Failures

Check server status indicators (power, fan, disk activity lights).

Remote connection: attempt SSH or other remote access to verify reachability.

Restart the server, after backing up critical data.

Inspect hardware components such as disks, memory, NICs, and power supply.

Verify that essential services and processes are running.

Review logs for error or warning messages.

Contact vendor or technical support if the issue cannot be resolved internally.

How to Resolve Server Failures

Restart the server to clear temporary issues.

Check and secure physical connections (power, network, data cables).

Inspect hardware health (disk, memory, CPU, power).

Analyze logs to pinpoint root causes.

Restore data from backups if corruption or loss occurred.

Apply OS, driver, and software patches.

Use diagnostic tools for hardware or performance troubleshooting.

Seek professional support when necessary.

Database Failure Handling

How to Detect Database Failures

Connection issues: applications cannot connect or receive connection refusals.

Database error logs: review MySQL, Oracle, or other DB logs for failure messages.

Monitoring tools: track CPU, memory, I/O, and other DB performance metrics.

How to Diagnose Database Failures

Check database service status.

Remote connection test from application servers or clients.

Verify configuration parameters, ports, and network settings.

Inspect disk space for data and log files.

Analyze transaction, error, and audit logs for anomalies.

Run health‑check utilities (e.g., DBVERIFY, CHECK TABLE).

Restart the database service after ensuring backups are in place.

How to Resolve Database Failures

Repair or restore corrupted database files using vendor tools.

Adjust database parameters based on the specific issue.

Perform performance tuning: optimize queries, indexes, and resources.

Upgrade or patch the database version to fix known bugs.

Restore from backups if data loss is severe.

Engage professional database support for complex problems.

Software Bug Handling

How to Detect Software Bugs

Application error messages or stack traces.

Abnormal behavior such as crashes, hangs, or unresponsiveness.

User feedback reporting unexpected issues.

How to Diagnose Software Bugs

Reproduce the issue to understand trigger conditions.

Analyze logs for exception details.

Use debugging tools to step through code.

Conduct code reviews to spot logical errors.

Verify environment and dependency configurations.

Check for available patches or updates.

How to Resolve Software Bugs

Fix the source code based on findings.

Adjust application configuration if misconfiguration caused the issue.

Apply updates or upgrade to newer software versions.

Security Vulnerability Handling

How to Detect Security Vulnerabilities

Regular security audits and scanning with dedicated tools.

Analyze security logs for suspicious activity.

Use IDS/IPS to detect exploitation attempts.

Monitor vendor security bulletins and vulnerability disclosures.

How to Diagnose Security Vulnerabilities

Review system and application configurations for weaknesses.

Audit access control and permission settings.

Monitor and analyze network traffic for anomalies.

Run malware scanning tools with up‑to‑date signatures.

How to Resolve Security Vulnerabilities

Apply security patches and updates promptly.

Strengthen access controls and enable multi‑factor authentication.

Encrypt data at rest and in transit using strong algorithms.

Configure firewalls, IDS/IPS, and other network defenses.

Implement continuous security monitoring and regular audits.

Conduct security awareness training for staff.

Perform periodic vulnerability assessments and penetration testing.

Ensure compliance with relevant security standards.

Develop and test incident response and disaster recovery plans.

Engage professional security consultants when needed.

Apply network segmentation and isolation to limit breach impact.

Maintain robust log management and analysis processes.

Secure physical access to servers and networking equipment.

Assess and improve supply‑chain security.

Respond quickly to incidents, isolate affected systems, collect evidence, and remediate.

Storage Failure Handling

How to Detect Storage Failures

Monitor storage health via vendor tools or third‑party solutions.

Check device indicator lights for fault signals.

Review server and storage error logs.

Observe application errors related to storage access.

How to Diagnose Storage Failures

Validate storage connections (power, cables, fiber, network).

Inspect disk health using SMART data or vendor diagnostics.

Run storage diagnostic utilities provided by the vendor.

Restart storage devices and associated servers if needed.

Consider data recovery and verify backup integrity.

How to Resolve Storage Failures

Replace failed disks following vendor guidelines.

Repair file‑system errors with appropriate tools.

Expand storage capacity by adding disks or upgrading arrays.

Migrate or rebuild data as required.

Seek vendor support for complex issues.

Configuration Error Handling

How to Detect Configuration Errors

Monitor system logs and error reports for configuration‑related messages.

Collect user feedback indicating abnormal behavior.

Perform comprehensive functional testing.

How to Diagnose Configuration Errors

Review configuration files for incorrect settings.

Check environment variables and command‑line parameters.

Compare against official documentation and best‑practice guides.

How to Resolve Configuration Errors

Correct the configuration files.

Update environment variables and command‑line arguments.

Restart affected applications or services.

Conduct functional and performance testing to confirm resolution.

Third‑Party Service Failure Handling

How to Detect Third‑Party Service Failures

Monitor service status pages or provider‑provided health checks.

Gather user feedback reporting issues related to the integration.

Inspect application logs for third‑party error messages.

How to Diagnose Third‑Party Service Failures

Determine the scope of the problem (local vs. provider‑wide).

Verify network connectivity and integration configuration (API keys, URLs, credentials).

Check the provider’s status page for maintenance or incidents.

Contact the provider’s support with detailed logs.

How to Resolve Third‑Party Service Failures

Correct integration configuration settings.

Restart your application or service.

Ensure network paths are open and not blocked by firewalls or proxies.

Refer to the provider’s status page for ongoing issues.

Engage the provider’s support team for assistance.

Consider alternative services if the provider cannot resolve the issue promptly.

Implement backup or fallback plans for critical third‑party dependencies.

operationssystem reliabilitytroubleshootingfault handlingProduction
Wukong Talks Architecture
Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.