22 Essential Ops Manager Tips for Building Resilient Web Infrastructure
This article compiles 22 practical recommendations from an operations manager covering domain management, CDN usage, image servers, data center selection, monitoring, security, redundancy, high‑availability architecture, disaster‑recovery planning, and team coordination to help ensure stable and secure online services.
1. Domain
Purchase multiple domains (e.g., 50‑100) from a reliable registrar such as GoDaddy, including domain protection to hide the real server IP. Manage DNS records on services like Cloudflare, DNSPod, or a self‑hosted DNS server for faster updates and multi‑IP resolution.
2. CDN
Buy a CDN service (e.g., Cloudflare) to cache and forward traffic, mitigate large‑scale attacks (up to 200 GB), and improve global access speed.
3. Image Server
Deploy dedicated image cache servers (NGINX can serve this role) separate from other services to accelerate image delivery.
4. Data Center
Select data centers with high reliability, strong DDoS protection, and responsive support; diversify across regions (e.g., Hong Kong for core servers, US for high‑defense nodes) to avoid single points of failure.
5. Homepage
Host a simple landing page on a cloud instance; use CDN or non‑备案 (non‑registered) hosting for restricted content to avoid domain or IP takedowns.
6. Monitoring System
Implement real‑time monitoring, log aggregation (e.g., syslog, Cacti), and alerting to detect traffic spikes and potential attacks.
7. Attack Defense
Use NGINX and iptables for low‑volume attacks; rely on high‑defense data centers and CDN for large‑scale DDoS, and be ready to switch domains to backup servers.
8. Redundancy
Design for at least double the expected concurrent users (e.g., 2 000 concurrent users for a 1 000‑user load) to handle traffic spikes.
9. Server Configuration
Equip servers with three network interfaces (public, internal, SSH management), multiple IPs, RAID‑1 storage, dual CPUs, dual power supplies, and avoid single points of failure.
10. Database
Set up master‑slave replication with off‑site backups; separate front‑end and back‑end services onto different machines; consider virtual machines for auxiliary services.
11. Test Environments
Maintain three environments: developer machines, internal LAN testing, and internet‑facing testing, each with version control (SVN or Git) and stable hardware.
12. Shield and Core Servers
Ensure connectivity between shield (front‑end) servers and core servers via ping tests to verify network paths.
13. Operations Staff
At least two operators (one manager, one engineer) with documented procedures, 24‑hour on‑call coverage, and a network administrator.
14. Linux Optimization & Security
Optimize NGINX and other services for CPU/memory, rotate passwords (e.g., every three months), especially for domain and email accounts.
15. LAN
Provide a stable LAN with at least 10 Mbps bandwidth, redundant cables, and a mobile Wi‑Fi hotspot for staff.
16. Large‑Scale Architecture
For extensive networks, build a dedicated core data center staffed by engineers across databases, networking, security, and storage.
17. Operations Tools
Standardize tools such as SQLyog for databases, CRT for SSH, KeePass for passwords, and WinSCP for file transfers; encourage continuous learning and English documentation review.
18. Disaster Recovery Plan
Maintain a documented failover plan, regularly practice restoration drills, and ensure backups are reliable.
19. Server Security
Implement comprehensive security hardening covering user, application, system, and file security.
20. High‑Concurrency Testing
Simulate 2 000 concurrent users to evaluate load handling; invest in necessary hardware and bandwidth.
21. Operations Information Sharing
Share all operational details (passwords, configurations) within the team, fostering a collaborative and skilled environment.
22. Ongoing Operations
After launch, continue with version upgrades, monitoring, performance tuning, database optimization, scaling architecture with traffic changes, security updates, and DevOps automation.
Article originally published on 简书 (Jianshu).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
