Operations 12 min read

Comprehensive Online Load‑Testing and Stability Assurance Framework

The stability‑assurance squad built an online load‑testing framework that injects global TraceIds via a Java‑agent, records real‑traffic, routes test writes to shadow databases and caches, enforces automatic stop‑rules, and provides a UI platform, reducing cost, improving capacity insight, and enabling safe fault‑injection drills.

NetEase Media Technology Team
NetEase Media Technology Team
NetEase Media Technology Team
Comprehensive Online Load‑Testing and Stability Assurance Framework

As online services become increasingly complex, the number of upstream and downstream request links grows, making multi‑dimensional stability assurance a critical daily task for technical teams. To proactively discover problems before incidents occur, a stability assurance squad was formed and a series of projects were launched.

Traditional load‑testing mainly targets test environments and monolithic services, which differ from production in configuration, hardware resources, and call‑chain characteristics. With the rise of microservices, the conventional approach of small‑scale offline deployment and mock services no longer works. The ideal solution is to conduct load‑testing in the production environment using traffic that closely resembles real user flow.

The squad developed a full‑link tracing project. The core idea is to inject a globally unique TraceId into request headers and propagate it across services, enabling correlation of the same request across different services. By also adding a load‑test marker in the headers via a tracer agent, the system can distinguish test traffic from normal traffic, solving the biggest obstacle of online load‑testing.

3.1 Load‑traffic Construction The tracer agent starts with each service, intercepts all requests, and propagates the TraceId and load‑test flag downstream. Based on real‑time configuration from a distributed config center, it can optionally record the request path, headers, and parameters (stored in an Elasticsearch cluster). Compared with Tcp copy or goreplay, which only copy layer‑4 traffic, this solution can capture traffic at any point in the call chain, offering greater flexibility.

3.2 Data Pollution To avoid contaminating production data, the team adopted shadow‑storage techniques for databases, caches, logs, and message queues. Interception covers JDBC, Redis, Memcached, Kafka, RabbitMQ, Logback, HTTP, RPC, etc., routing write operations generated by load‑test traffic to shadow instances.

For example, the database shadow logic intercepts setJdbcUrl and getConnection methods of HikariCP to replace the real datasource with a shadow datasource when the current request is marked as load‑test traffic.

private void addHikariInterceptor() {
    this.transformTemplate.transform("com.zaxxer.hikari.HikariConfig", new TransformCallback() {
        @Override
        public byte[] doInTransform(Instrumentor instrumentor, ClassLoader classLoader, String className,
                Class
classBeingRedefined, ProtectionDomain protectionDomain, byte[] classfileBuffer)
                throws InstrumentException {
            InstrumentClass target = instrumentor.getInstrumentClass(classLoader, className, classfileBuffer);
            InstrumentMethod setJdbcUrlMethod = InstrumentUtils.findMethod(target, "setJdbcUrl",
                    String.class.getName());
            setJdbcUrlMethod.addInterceptor(SetJdbcUrlInterceptor.class.getName());
            return target.toBytecode();
        }
    });
    addGetConnectionInterceptor("com.zaxxer.hikari.HikariDataSource", TransformHandler.EMPTY_HANDLER);
}

private void addGetConnectionInterceptor(String className, TransformHandler handler) {
    this.transformTemplate.transform(className, new TransformCallback() {
        @Override
        public byte[] doInTransform(Instrumentor instrumentor, ClassLoader classLoader, String className,
                Class
classBeingRedefined, ProtectionDomain protectionDomain, byte[] classfileBuffer)
                throws InstrumentException {
            InstrumentClass target = instrumentor.getInstrumentClass(classLoader, className, classfileBuffer);
            handler.handle(target);
            InstrumentMethod getConnectionMethod = InstrumentUtils.findMethod(target, "getConnection");
            getConnectionMethod.addInterceptor(JdbcInterceptor.class.getName());
            return target.toBytecode();
        }
    });
}

The LoadTestGetConnectionInterceptor checks the current TraceId context; if the request is identified as load‑test traffic, it returns a connection from the shadow datasource, otherwise it proceeds normally.

@Override
public Ret before(Object target, Object[] args) {
    SofaTracerSpan span = SofaTraceContextHolder.getSofaTraceContext().getCurrentSpan();
    if (span != null && span.getSofaTracerSpanContext().isLoadTest()) {
        Object testDb = DataSourceHolder.getInstance().getLoadTestDb(target);
        if (testDb != null) {
            DataSource dateSource = (DataSource) testDb;
            try {
                return Ret.newInstanceForReturn(dateSource.getConnection());
            } catch (SQLException e) {
                if (logger.isWarnEnabled()) {
                    logger.warn("Failed to getConnection. {}", e.getMessage(), e);
                }
            }
        }
        if (DataSourceHolder.getInstance().isLoadTestDb(target)) {
            // shadow DB intercept, do nothing
            return Ret.newInstanceForNone();
        } else {
            // no config or error, return null to avoid affecting normal DB logic
            return Ret.newInstanceForReturn(null);
        }
    }
    return Ret.newInstanceForNone();
}

3.3 Load‑Test Risk Control The platform integrates with the internal monitoring system, allowing users to define stop rules (e.g., QPS, error rate, CPU, memory). When a rule’s threshold is reached, traffic is automatically halted. Rules are expressed as JSON arrays.

3.4 Load‑Test Platform The UI‑driven platform manages traffic recording, replay, stop‑rule configuration, and automatic report generation. All tests must first be rehearsed in a test environment, followed by a small‑scale production validation before full load‑testing.

4 Advantages

Java‑agent implementation of load‑test markers and shadow storage without code intrusion.

Real‑traffic recording provides high fidelity and eliminates the need for manual script preparation.

Automatic stop rules prevent impact on real users.

Visual workflow greatly reduces operational cost for testing teams.

5 Application Benefits

Reduced load‑testing cost through automated topology discovery and traffic recording.

Capacity assessment: determine QPS limits and adjust cluster size accordingly.

Fault‑injection drills using low‑traffic periods to verify degradation, flow‑control, and circuit‑breaker behavior.

6 Optimization Directions

Migrate shadow‑storage functionality to a service mesh for language‑agnostic, non‑intrusive load‑testing.

Develop tools to simplify pre‑test data synchronization.

Introduce algorithmic anomaly detection to replace manual threshold tuning.

Formalize a ticket‑based approval workflow with clear responsibility.

Build a scoring system for load‑test results to drive continuous service quality improvement.

MicroservicesDistributed Tracingload testingJava agentStabilityshadow storage
NetEase Media Technology Team
Written by

NetEase Media Technology Team

NetEase Media Technology Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.