Operations 12 min read

22 Essential Ops Manager Tips for Building Resilient Web Infrastructure

This article compiles 22 practical recommendations from an operations manager covering domain management, CDN usage, image servers, data center selection, monitoring, security, redundancy, high‑availability architecture, disaster‑recovery planning, and team coordination to help ensure stable and secure online services.

Efficient Ops

Aug 13, 2017

22 Essential Ops Manager Tips for Building Resilient Web Infrastructure

1. Domain

Purchase multiple domains (e.g., 50‑100) from a reliable registrar such as GoDaddy, including domain protection to hide the real server IP. Manage DNS records on services like Cloudflare, DNSPod, or a self‑hosted DNS server for faster updates and multi‑IP resolution.

2. CDN

Buy a CDN service (e.g., Cloudflare) to cache and forward traffic, mitigate large‑scale attacks (up to 200 GB), and improve global access speed.

3. Image Server

Deploy dedicated image cache servers (NGINX can serve this role) separate from other services to accelerate image delivery.

4. Data Center

Select data centers with high reliability, strong DDoS protection, and responsive support; diversify across regions (e.g., Hong Kong for core servers, US for high‑defense nodes) to avoid single points of failure.

5. Homepage

Host a simple landing page on a cloud instance; use CDN or non‑备案 (non‑registered) hosting for restricted content to avoid domain or IP takedowns.

6. Monitoring System

Implement real‑time monitoring, log aggregation (e.g., syslog, Cacti), and alerting to detect traffic spikes and potential attacks.

7. Attack Defense

Use NGINX and iptables for low‑volume attacks; rely on high‑defense data centers and CDN for large‑scale DDoS, and be ready to switch domains to backup servers.

8. Redundancy

Design for at least double the expected concurrent users (e.g., 2 000 concurrent users for a 1 000‑user load) to handle traffic spikes.

9. Server Configuration

Equip servers with three network interfaces (public, internal, SSH management), multiple IPs, RAID‑1 storage, dual CPUs, dual power supplies, and avoid single points of failure.

10. Database

Set up master‑slave replication with off‑site backups; separate front‑end and back‑end services onto different machines; consider virtual machines for auxiliary services.

11. Test Environments

Maintain three environments: developer machines, internal LAN testing, and internet‑facing testing, each with version control (SVN or Git) and stable hardware.

12. Shield and Core Servers

Ensure connectivity between shield (front‑end) servers and core servers via ping tests to verify network paths.

13. Operations Staff

At least two operators (one manager, one engineer) with documented procedures, 24‑hour on‑call coverage, and a network administrator.

14. Linux Optimization & Security

Optimize NGINX and other services for CPU/memory, rotate passwords (e.g., every three months), especially for domain and email accounts.

15. LAN

Provide a stable LAN with at least 10 Mbps bandwidth, redundant cables, and a mobile Wi‑Fi hotspot for staff.

16. Large‑Scale Architecture

For extensive networks, build a dedicated core data center staffed by engineers across databases, networking, security, and storage.

17. Operations Tools

Standardize tools such as SQLyog for databases, CRT for SSH, KeePass for passwords, and WinSCP for file transfers; encourage continuous learning and English documentation review.

18. Disaster Recovery Plan

Maintain a documented failover plan, regularly practice restoration drills, and ensure backups are reliable.

19. Server Security

Implement comprehensive security hardening covering user, application, system, and file security.

20. High‑Concurrency Testing

Simulate 2 000 concurrent users to evaluate load handling; invest in necessary hardware and bandwidth.

21. Operations Information Sharing

Share all operational details (passwords, configurations) within the team, fostering a collaborative and skilled environment.

22. Ongoing Operations

After launch, continue with version upgrades, monitoring, performance tuning, database optimization, scaling architecture with traffic changes, security updates, and DevOps automation.

Article originally published on 简书 (Jianshu).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations Disaster Recovery infrastructure

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.