Operations 20 min read

How We Boosted CI Automation Efficiency by 50%: A Real‑World Case Study

This article details how a development team identified performance, stability, and metric shortcomings in their automated CI pipeline and implemented architectural, process, and tooling improvements—including GPU‑enabled Linux runners, dynamic thread pools, result noise reduction, and comprehensive dashboards—to dramatically increase test throughput, reliability, and overall automation value.

Qunhe Technology Quality Tech

Dec 25, 2024

How We Boosted CI Automation Efficiency by 50%: A Real‑World Case Study

1. Background Challenges

Automated continuous integration (CI) is widely adopted, promising faster testing, reduced manual work, and confidence in releases, but the team faced several issues that caused the actual outcomes to diverge from expectations.

As business complexity grew, automation execution slowed, resources became scarce, queue times lengthened, and each release was delayed. Stability problems—script, platform, and environment failures—eroded trust in automation results, leading to unreproducible failures and ignored CI checkpoints. Over‑optimizing automation metrics resulted in negative‑EV investments, and CI checkpoints became ineffective, with many automated cases running daily yet still missing critical defects.

2. Solution Overview

To restore the true value of automated CI, the team tackled pain points across CI workflow, infrastructure, and platform capabilities, aiming to maintain stable automation while improving efficiency, CI checkpoint effectiveness, and reducing investigation costs, all measured by clear metrics.

2.1 Existing Architecture Overview

The CI architecture uses internal DevOps platforms (named Pub and Moon) for front‑end and back‑end builds and releases. Various automation types—API tests, UI tests, unit tests, mutation tests, static scans—are scheduled via Jenkins, with most Jenkins slaves managed by Kubernetes. The focus of the improvement is on API and UI automation.

3. Implementation Details

3.1 Improving Execution Efficiency

Automation tasks (code changes, deployments, scheduled jobs) trigger various test suites; UI automation is the slowest, followed by API automation and unit tests. The target is to keep each task under 30 minutes.

3.1.1 UI Automation

Initially, UI tests ran on Windows machines with limited resources, causing average queue times of ~30 minutes and execution times over an hour per task. The solution was to migrate to GPU‑enabled Linux physical machines, run tests headlessly, and allocate resources (8 vcuda‑core, 3 vcuda‑memory per pod). Two machines now support up to 96 concurrent tasks. Concurrency was further increased by running tests in parallel at the file level.

After a half‑year migration, most UI cases run smoothly on Linux; a small subset still requires Windows due to rendering issues and is handled via test case tagging and mixed‑environment execution.

Stage

Daily Runs

Queue Time

Execution Time

Before Optimization

300

30 min

60 min

After Optimization

2500‑3000

30 s

15 min

3.1.2 API Automation

API tests require building test environments, which can be time‑consuming. The team caches builds for identical branches/commits, covering ~70 % of tasks. To balance load, they switched from a single large thread pool to multiple smaller pools with dynamic scaling, achieving better throughput without overloading services.

Strategy

Details

Pros/Cons

Single 12‑thread pool

All CI tasks share one pool

FIFO ordering

Adjustable thread count

Fast execution per task

Increases service load

Fixed size requires manual intervention when backlog occurs

Thread‑pool groups (3×6 threads, 1.5× elasticity)

Tasks routed to specific pools

FIFO ordering per pool

Auto‑scaling per pool

Adjustable group count

Fast execution per task

Same load‑scaling trade‑offs as above

3.2 CI Process Optimization

The team introduced left‑shift and right‑shift testing, traffic replay, and log inspection to increase automation value.

3.2.1 Left‑Shift Testing

Attempting to run tests at merge‑request (MR) stage revealed issues: frequent MR runs blocked pipelines, developers were reluctant to investigate test failures, and MR code stability was low. The solution was to shift automation to the “test‑request” stage, where testers trigger smoke tests and can intervene when automation fails.

3.2.2 Automated Issue Creation

Automation results now generate issues or actionable items. To reduce noise, the team added a result‑confirmation step and smart analysis before creating issues, supporting both issue‑based and item‑based workflows to accommodate different agile teams.

3.2.3 Result Notification & Dashboards

Automation outcomes are pushed daily to personal, team, and manager dashboards, providing visibility into failures, trends, and key metrics.

3.3 Automation Stability Governance

3.3.1 Result Stability

To improve result reliability, the team applies noise reduction and intelligent analysis. After retries, persistent failures are re‑run on a stable baseline; matching results are treated as noise, otherwise they are reported as bugs. Additional business‑specific rules further filter out non‑actionable failures.

3.3.2 Failure Investigation

For failed API tests, the system records service snapshots, warnings, health status, and logs, as well as code coverage via JaCoCo. For UI failures, screenshots, video recordings, console errors, and network logs are captured.

3.3.3 Jenkins Stability

Jenkins, the core CI scheduler, faced issues as daily automation volume grew to >10 k runs, including full GC pauses, disk space exhaustion, and network bottlenecks. Solutions included limiting retained build history, offloading logs to object storage, and avoiding unnecessary plugins.

3.4 Metric Measurement

The team defined a three‑level metric hierarchy: primary result metrics (automation fault discovery rate and omission rate), secondary decomposition metrics (coverage, checkpoint interception, key‑scenario coverage), and tertiary improvement metrics (specific actions to boost the primary metrics). These metrics drive continuous improvement without becoming a punitive KPI.

4. Outcomes & Future Plans

After implementing the improvements, the team achieved a 50 % reduction in overall execution time, lowered failure rates from 17 % to 2 %, and increased automation issue effectiveness from 40 % to 70 %. Future work includes integrating all internal regression platforms into CI, expanding left‑ and right‑shift testing, and exploring AI‑assisted test generation, intelligent result analysis, and automated remediation to further boost testing efficiency and software quality.

Testing DevOps CI

Written by

Qunhe Technology Quality Tech

Kujiale Technology Quality

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.