Operations 16 min read

How JD’s ForceBot Revolutionizes Full‑Chain Load Testing for Massive Shopping Events

ForceBot is JD.com’s comprehensive full‑chain load‑testing platform that simulates user behavior across the entire purchase flow, isolates test traffic, leverages Docker‑based agents, GRPC services, and real‑time data analytics to identify bottlenecks, optimize resource planning, and support both routine and peak‑traffic scenarios.

Efficient Ops
Efficient Ops
Efficient Ops
How JD’s ForceBot Revolutionizes Full‑Chain Load Testing for Massive Shopping Events

ForceBot Vision

1. Background

As JD’s business expands, the R&D system grows, creating tightly coupled core systems. A bottleneck in any system can degrade the entire processing chain and hurt the shopping experience.

Preparing for major sales events (e.g., 618, Double‑11) traditionally required months of effort and large manual load‑testing workloads, with inaccurate offline‑online data correlation. Capacity planning relied on historical experience, leading to excessive resource requests each promotion.

In 2016, the foundation platform team launched the ForceBot full‑chain load‑testing project to address these challenges, requiring all systems to distinguish test traffic from real traffic.

2. Capabilities

ForceBot simulates high‑concurrency user behavior across the entire purchase journey: homepage, login, search, product detail, cart, checkout, and JD Pay.

It models various scenarios, such as normal traffic versus promotion spikes, and supports dynamic concurrency scaling based on historical peak values.

Since 2016, ForceBot has become the primary source for performance data, capacity planning, and bottleneck identification during large‑scale events.

ForceBot Technical Architecture

First‑generation Platform (Ngrinder)

Based on the open‑source Java load‑testing tool Ngrinder, which consists of a controller and multiple agents. Ngrinder was extended to support Python and Groovy scripts and added features like scheduling and incremental load.

Controller‑agent communication uses a BIO protocol, limiting throughput.

The controller is a single point of failure; many agents cause it to become a bottleneck.

Result: The controller’s workload prevented further scaling, prompting the development of ForceBot.

ForceBot Architecture

The new platform decouples functionality to eliminate bottlenecks and enable horizontal scaling.

Task Service handles task distribution and supports scaling.

Agents register, pull tasks, and execute them.

Monitor Service forwards performance data to JMQ.

Dataflow performs stream processing and stores results in a database.

Git stores test scripts and libraries.

This redesign greatly reduces controller load and improves metric collection.

1. Container Deployment

Agents run in Docker containers, allowing rapid cluster creation, elastic scaling, resource isolation, and standardized virtual‑user capacity.

Problem: Sharing test scripts across Docker instances. Solution: Use a shared host disk for script distribution.

2. Task Allocation

Users define test scenarios (pressure sources, virtual users, scripts, schedule, JVM parameters). The controller scans the database, matches idle agents, and calculates per‑worker thread counts.

Two load patterns are supported:

Spike (burst) pattern: sudden high request volume to test concurrency.

Ramp‑up pattern: gradual increase of virtual users at fixed intervals.

Dynamic ramp‑up/down is achieved by adjusting agent thread counts.

Problem: Shared shopping‑cart conflicts during concurrent test orders. Solution: Bind each test thread to a unique user ID.

3. Heartbeat and Task Dispatch

Task Service provides registration, task retrieval, and status updates for agents. Agents send heartbeats every few seconds; the controller uses these to determine liveness.

Task Service is built on gRPC (HTTP/2, protobuf3, Netty4) for cross‑network calls, encryption, and authentication.

4. Agent Implementation

Agents generate a UUID, register with Task Service, pull tasks, and launch worker processes. Communication between agent and worker uses stdout/stdin streams.

Problem: High Git load when many Docker instances update test scripts. Solution: Store scripts on shared host storage and use file locking per task.
Problem: Simulating instantaneous peak load. Solution: Pre‑allocate threads at collection points, block them, and release the required number to create a sudden surge.

5. Data Collection and Computation

Monitor Service receives per‑second metrics from agents via gRPC, forwards them through JMQ to the Dataflow platform, which computes TPS, TP99, TP90, etc., and stores results in Elasticsearch for querying.

Problem: Massive data volume from agents. Solution: Aggregate per‑second metrics before transmission.

Business System Refactoring

1. Golden‑Flow Business

Identifies the end‑to‑end purchase flow from browsing to order completion.

2. Test‑Traffic Identification

Marks users and products to separate test traffic from production metrics, ensuring no impact on real‑world statistics.

3. Test Data Storage

Mark data stored in production DB with periodic cleanup.

Separate test database for isolated storage, used for payment system simulations with a mock bank.

Future Plans

Unattended intelligent testing that auto‑scales Docker resources based on performance thresholds and stops when no further gains are observed.

AI‑driven forecasting: using big‑data AI models to predict target order volumes and automatically assess system bottlenecks.

Online full‑chain load testing and continuous readiness are the ultimate goals.

distributed systemsdockerautomationgRPCload testingperformance engineering
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.