How to Achieve 99.9% Consistency for PaaS Agents at Scale
This article explains the consistency problems of PaaS agents in large clusters, introduces an entity‑based authority model and an automated repair workflow, and shows how the solution raised agent consistency to over 99.5% while reducing operational effort.
Problem Background
In a PaaS platform each machine runs a resident agent. When the cluster grows, weekly agent updates cause survival and consistency issues, sometimes dropping consistency below 99%, meaning hundreds of machines have mismatched agent versions, leading to unexpected failures such as service disruption and data corruption.
To address this pain point, we organized common operational scenarios and built a solution aiming to keep consistency above three nines.
Solution Idea
We abstract the scenarios into “entities”, which are typically files (binary or configuration) or composite settings such as multi‑queue NIC.
Entity – the object (file, URL, etc.).
Entity State – version, MD5, ID and other status information.
The simplified inconsistency lifecycle is shown in the diagram below.
To achieve reliable automatic repair we define an Entity Authority Center that stores the authoritative information for each entity. The fields are:
Name – entity name
Type – entity type, usually file or url Path – entity path or URL
Md5 – correct MD5 value for comparison
Version – version information used by business programs
Collect_script – script that normalizes state collection output, including MD5
Repair_script – method to fix an inconsistent entity (deployment system, job system, etc.)
Entity_set – list of machines where the entity is deployed
Automatic Repair Process
1. Inconsistency occurs for various reasons.
2. StatusCollector gathers state information of all entities in the EntitySet (usually a machine list).
3. ConsistencyCalculator compares the collected data with the Entity Authority Center and identifies mismatched entities.
4. Outputs a list of inconsistent entities.
5. Launches repair tasks based on entity type.
6. For services hosted in a deployment system, invoke the deployment system with the correct SVN/Git version to redeploy; for system settings such as NIC multi‑queue or limits.conf, invoke a job system (e.g., Ansible) to re‑apply the configuration.
7. After repair, the cluster returns to a consistent state.
Key points in the process:
1. Simplified consistency calculation – For text or binary files, direct MD5 comparison is sufficient. For complex features like NIC multi‑queue, we collect the full /proc/irq/*/smp_affinity output, compute a single MD5, and compare that value, avoiding per‑IRQ comparisons.
2. Integration with deployment system – Before launching a repair task, we query the deployment system to ensure no ongoing deployment and to obtain the correct version information for the target entity, guaranteeing safe and accurate fixes.
Results
After the system went live, agent consistency rose to 99.5% and overall system environment consistency reached 99.9%, reducing stability‑related incidents and saving considerable operational manpower.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
