Operations 7 min read

How to Achieve 99.9% Consistency for PaaS Agents at Scale

This article explains the consistency problems of PaaS agents in large clusters, introduces an entity‑based authority model and an automated repair workflow, and shows how the solution raised agent consistency to over 99.5% while reducing operational effort.

Efficient Ops
Efficient Ops
Efficient Ops
How to Achieve 99.9% Consistency for PaaS Agents at Scale

Problem Background

In a PaaS platform each machine runs a resident agent. When the cluster grows, weekly agent updates cause survival and consistency issues, sometimes dropping consistency below 99%, meaning hundreds of machines have mismatched agent versions, leading to unexpected failures such as service disruption and data corruption.

To address this pain point, we organized common operational scenarios and built a solution aiming to keep consistency above three nines.

Solution Idea

We abstract the scenarios into “entities”, which are typically files (binary or configuration) or composite settings such as multi‑queue NIC.

Entity – the object (file, URL, etc.).

Entity State – version, MD5, ID and other status information.

The simplified inconsistency lifecycle is shown in the diagram below.

To achieve reliable automatic repair we define an Entity Authority Center that stores the authoritative information for each entity. The fields are:

Name – entity name

Type – entity type, usually file or url Path – entity path or URL

Md5 – correct MD5 value for comparison

Version – version information used by business programs

Collect_script – script that normalizes state collection output, including MD5

Repair_script – method to fix an inconsistent entity (deployment system, job system, etc.)

Entity_set – list of machines where the entity is deployed

Automatic Repair Process

1. Inconsistency occurs for various reasons.

2. StatusCollector gathers state information of all entities in the EntitySet (usually a machine list).

3. ConsistencyCalculator compares the collected data with the Entity Authority Center and identifies mismatched entities.

4. Outputs a list of inconsistent entities.

5. Launches repair tasks based on entity type.

6. For services hosted in a deployment system, invoke the deployment system with the correct SVN/Git version to redeploy; for system settings such as NIC multi‑queue or limits.conf, invoke a job system (e.g., Ansible) to re‑apply the configuration.

7. After repair, the cluster returns to a consistent state.

Key points in the process:

1. Simplified consistency calculation – For text or binary files, direct MD5 comparison is sufficient. For complex features like NIC multi‑queue, we collect the full /proc/irq/*/smp_affinity output, compute a single MD5, and compare that value, avoiding per‑IRQ comparisons.

2. Integration with deployment system – Before launching a repair task, we query the deployment system to ensure no ongoing deployment and to obtain the correct version information for the target entity, guaranteeing safe and accurate fixes.

Results

After the system went live, agent consistency rose to 99.5% and overall system environment consistency reached 99.9%, reducing stability‑related incidents and saving considerable operational manpower.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Configuration ManagementPaaSagent consistency
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.