How to Scale Internet Operations with Standardization, Config Management, and Monitoring
This article explores how large‑scale internet operations can achieve order and efficiency by applying entropy theory, standardizing configuration and monitoring, adopting automated deployment practices, and leveraging open‑source tools like Open‑Falcon to build a fully automated, resilient infrastructure.
Applying Entropy Theory to Operations
An isolated system naturally drifts toward disorder; external energy is required to impose order. Large internet systems, with thousands of servers and rapid iteration, tend toward chaos, demanding deliberate effort to maintain stability.
Scaling Internet Operations
Supporting 10 servers versus 2,000 servers illustrates the power of standardized monitoring and deployment. When services share identical monitoring, deployment, and automation, a single engineer can comfortably manage thousands of machines, as demonstrated by Didi’s growth from a few servers to tens of thousands.
Standardization is the prerequisite for automation.
Service Standardization Efforts
Configuration Management
Common configuration items
Feature switches (degrade, debug, A/B testing)
Adjustable parameters (timeouts, concurrency limits, log levels)
Upstream connection information
Upstream information is the most tightly coupled and complex part of a system.
Typical management approaches:
Via LVS : vip:port → real‑server list. Benefits: high availability, load balancing, health checks. Drawback: creates a single point of failure.
Via Nginx : ip:port/server/location → upstream list. Similar pros/cons, operating at layer 7.
Via DNS : domain → IP list. Simple but limited load‑balancing and slow failover.
Via Zookeeper/etcd : service discovery. Mature but vulnerable to network partitions.
Via local config files : direct embedding in modules. Leads to scattered topology and slow failover.
We need a configuration system that decouples code from environment, centralizes management, supports instant effect, health checks, load balancing, and can be layered with automatic registration and discovery.
Monitoring
Monitoring is the most critical part of the product lifecycle, providing early fault detection and post‑mortem data.
Current problematic practices
Online log analysis with complex regexes – high maintenance and performance cost.
Offline log analysis – low timeliness and high resource consumption.
Ad‑hoc status endpoints – inconsistent formats across teams.
External monitoring “add‑ons” – indicate lack of built‑in observability.
These approaches lead to tightly coupled, hard‑to‑maintain monitoring solutions.
Two fundamental principles
Metric collection must be part of the code; increase coverage.
Monitoring methods and metrics must be standardized and tool‑supported.
Monitoring standards
Every API should be monitorable and report at least:
cps latency‑50th/75th/95th/99th error_rate error_count
Optional custom metrics (e.g., caller, callee) can provide call‑graph detail. All metrics should be pushed proactively without prior registration.
Example: nginx metrics collection (see implementation link) includes API tags, error counts, and upstream stats.
Standardized metrics enable a few alert rules to cover most services and allow tailored dashboards for different stakeholders.
Deployment
Deployment means keeping a defined number of instances running across resources, but real‑world pain points include high onboarding cost, diverse language stacks, chaotic upstream info, fragmented user experience, and incremental update coordination.
Deployment principles
Version‑based releases.
Unified packaging (e.g., Docker images).
Centralized configuration decoupled from environments.
Standardized rollout flow with preview, canary, integrated monitoring, and trend analysis.
Network‑based log dependency.
Benefits of sustained standardization
Configuration and environment decoupled.
Monitoring standardized.
Deployment standardized.
Logs networked.
Data services exposed.
Instances self‑discover.
Resources containerized.
Result: fully automated scheduling and operation.
About Open‑Falcon
Open‑Falcon is an open‑source, enterprise‑grade, highly available, and scalable monitoring system originally launched by Xiaomi’s operations team. It now enjoys a large community of over 2,000 contributors from companies such as Xiaomi, Meituan, Kuaiyun, and Didi.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
