Operations 12 min read

How Distributed Sharding and Locking Boost Project Environment Creation to 99% Success

This article details how a large‑scale project environment platform was refactored using domain‑driven design, distributed sharding, thread‑pool parallelism, second‑level scheduling, and distributed locks to achieve over 99% creation success, reduce creation time below 100 seconds, and keep exception rates under 1% despite massive task volume.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Distributed Sharding and Locking Boost Project Environment Creation to 99% Success

Introduction

Project environment is a platform tool essential for cross‑team testing, providing dynamic, isolated environments that improve developers' testing experience by abstracting infrastructure and micro‑service complexity.

Unlike fixed environments, project environments are bound to change lifecycles, leading to high‑frequency create, deploy, restart operations that can reach thousands per second during peak development, demanding extreme stability.

Initially, tasks were executed on a single machine via a workflow engine and scheduled jobs, causing duplicate runs, silent failures, and dead tasks as load grew.

Technical Practice

Process Overview

When a change is created, the environment creation starts automatically; non‑cloud‑native apps request resources after creation, while cloud‑native apps provision resources during deployment. Three core work orders (create, deploy, restart) follow the flow shown in the diagram.

Task Characteristics

Tasks have ordered dependencies and are executed asynchronously, forming a pipeline model driven by a workflow engine.

Current Issues

Asynchronous tasks are triggered by users, polled by scheduled jobs, and progress via messages, leading to dead tasks, single‑machine bottlenecks, and duplicate executions.

Task Death

Unexpected exceptions abort a task, preventing the completion message from being sent and breaking the pipeline.

Single‑Machine Bottleneck

Non‑reentrant tasks duplicated by distributed schedulers waste resources and cause errors; single‑machine scheduling leaves many CPUs idle.

Duplicate Execution

Short scheduling intervals combined with high task volume cause the same task to be processed multiple times.

Optimization Path

1. Task Death Optimization

Refactored the workflow engine using Domain‑Driven Design, extracting four domain entities: GroupEnv, AppRunningEnv, Operation, and TaskEngine, and rebuilt their dependencies.

Implemented a unified executor interface, factory method for executor selection, centralized exception handling, and comprehensive unit tests, removing the strong dependency on MetaQ.

2. Execution Time Optimization

Adopted SchedulerX 2.0’s map‑reduce model to shard tasks across multiple workers, enabling parallel execution and reducing total execution time by a factor of the number of machines.

3. Duplicate Execution Optimization

Added a thread pool to each worker for intra‑node parallelism and used second‑level scheduling to ensure all workers finish before the next cycle, further cutting execution time by 1/(n·x).

4. Multi‑Machine Busy‑Wait Solution

Replaced intra‑worker synchronization with a distributed lock on AppRunningEnv IDs, allowing workers to skip already‑locked environments and releasing expired locks to avoid dead nodes.

Results

Environment creation success rate stabilized above 99% (excluding cloud‑native apps). Creation time dropped from over 300 seconds to under 100 seconds despite a hundred‑fold increase in environment count. System exception rate stayed below 1% while parallel work order throughput grew dramatically. Single‑task execution time fell to one‑sixteenth of the original peak.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SchedulerDDDWorkflow EngineDistributed Tasks
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.