Operations 13 min read

How to Scale Internet Operations with Standardization, Config Management, and Monitoring

This article explores how large‑scale internet operations can achieve order and efficiency by applying entropy theory, standardizing configuration and monitoring, adopting automated deployment practices, and leveraging open‑source tools like Open‑Falcon to build a fully automated, resilient infrastructure.

Efficient Ops
Efficient Ops
Efficient Ops
How to Scale Internet Operations with Standardization, Config Management, and Monitoring

Applying Entropy Theory to Operations

An isolated system naturally drifts toward disorder; external energy is required to impose order. Large internet systems, with thousands of servers and rapid iteration, tend toward chaos, demanding deliberate effort to maintain stability.

Scaling Internet Operations

Supporting 10 servers versus 2,000 servers illustrates the power of standardized monitoring and deployment. When services share identical monitoring, deployment, and automation, a single engineer can comfortably manage thousands of machines, as demonstrated by Didi’s growth from a few servers to tens of thousands.

Standardization is the prerequisite for automation.

Service Standardization Efforts

Configuration Management

Common configuration items

Feature switches (degrade, debug, A/B testing)

Adjustable parameters (timeouts, concurrency limits, log levels)

Upstream connection information

Upstream information is the most tightly coupled and complex part of a system.

Typical management approaches:

Via LVS : vip:port → real‑server list. Benefits: high availability, load balancing, health checks. Drawback: creates a single point of failure.

Via Nginx : ip:port/server/location → upstream list. Similar pros/cons, operating at layer 7.

Via DNS : domain → IP list. Simple but limited load‑balancing and slow failover.

Via Zookeeper/etcd : service discovery. Mature but vulnerable to network partitions.

Via local config files : direct embedding in modules. Leads to scattered topology and slow failover.

We need a configuration system that decouples code from environment, centralizes management, supports instant effect, health checks, load balancing, and can be layered with automatic registration and discovery.

Monitoring

Monitoring is the most critical part of the product lifecycle, providing early fault detection and post‑mortem data.

Current problematic practices

Online log analysis with complex regexes – high maintenance and performance cost.

Offline log analysis – low timeliness and high resource consumption.

Ad‑hoc status endpoints – inconsistent formats across teams.

External monitoring “add‑ons” – indicate lack of built‑in observability.

These approaches lead to tightly coupled, hard‑to‑maintain monitoring solutions.

Two fundamental principles

Metric collection must be part of the code; increase coverage.

Monitoring methods and metrics must be standardized and tool‑supported.

Monitoring standards

Every API should be monitorable and report at least:

cps latency‑50th/75th/95th/99th error_rate error_count

Optional custom metrics (e.g., caller, callee) can provide call‑graph detail. All metrics should be pushed proactively without prior registration.

Example: nginx metrics collection (see implementation link) includes API tags, error counts, and upstream stats.

Standardized metrics enable a few alert rules to cover most services and allow tailored dashboards for different stakeholders.

Deployment

Deployment means keeping a defined number of instances running across resources, but real‑world pain points include high onboarding cost, diverse language stacks, chaotic upstream info, fragmented user experience, and incremental update coordination.

Deployment principles

Version‑based releases.

Unified packaging (e.g., Docker images).

Centralized configuration decoupled from environments.

Standardized rollout flow with preview, canary, integrated monitoring, and trend analysis.

Network‑based log dependency.

Benefits of sustained standardization

Configuration and environment decoupled.

Monitoring standardized.

Deployment standardized.

Logs networked.

Data services exposed.

Instances self‑discover.

Resources containerized.

Result: fully automated scheduling and operation.

About Open‑Falcon

Open‑Falcon is an open‑source, enterprise‑grade, highly available, and scalable monitoring system originally launched by Xiaomi’s operations team. It now enjoys a large community of over 2,000 contributors from companies such as Xiaomi, Meituan, Kuaiyun, and Didi.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AutomationoperationsScalabilityconfiguration-management
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.