Big Data 19 min read

How HuoLala Built a Scalable Big Data Warehouse CI/CD Pipeline

This article details how HuoLala designed and implemented a standardized, automated big‑data warehouse release pipeline that integrates admission and exit checkpoints, isolates test data, and uses CI/CD practices to improve efficiency, reduce risk, and achieve near‑zero manual testing for data quality assurance.

Huolala Tech

May 27, 2025

How HuoLala Built a Scalable Big Data Warehouse CI/CD Pipeline

Background and Challenges

Unlike traditional project releases, HuoLala’s big‑data offline data‑warehouse tasks are released on the offline development platform, but the current release process suffers from lack of strict control, poor test awareness, and no isolated test tables, leading to production data pollution.

Warehouse Release Process Issues

Release process is loosely controlled.

Testers have no visibility of releases.

No dedicated test tables, causing test data to pollute production.

Warehouse Task Situation

Many self‑test tasks with low awareness, prone to problems.

Frequent task changes with wide impact, manual testing is time‑consuming.

To address these problems, a standardized release process combined with engineering practices and quality‑efficiency tools was designed to build an integrated solution aiming at a more efficient, higher‑quality continuous delivery pipeline and eventually achieve fully automated testing.

Solution and Goals

We collaborated with the big‑data offline development platform and adopted the following measures:

Release Process Standardization : add approval and entry/exit checkpoints.

Physical Isolation of Test Data : write test data to a gray‑space to avoid contaminating online data.

Automated Admission/Exit : automatically detect changeable tasks, trigger checks, and block releases on failure.

Target outcomes include risk mitigation, early feedback, strict CI/CD standards, and reduced communication cost.

Capability Construction

The Big Data Warehouse Pipeline platform, built on Spring Boot, relies on the big‑data testing platform for task computation. It provides admission and exit functions and a checkpoint mechanism, consisting of presentation, service, storage, and data layers.

3.1 Platform Basic Capabilities

The platform supports pipeline generation and execution, result notification, task‑template application, warehouse task change analysis, node composition, and visualisation.

Pipeline Generation & Execution : automatic creation and management of data‑processing pipelines.

Result Notification : deliver results via alerts, reports, and visual dashboards.

Task Template Application : use predefined templates to standardise test task nodes.

Warehouse Task Change Analysis : generate test nodes based on change types.

Node Composition : flexibly combine admission/exit nodes to meet diverse checkpoints.

Visualisation : intuitive tools for analysing node flows and results.

3.1.1 Pipeline Generation and Execution

After a warehouse task is completed, the system runs it in a gray environment to generate Hive data, then triggers the pipeline.

The platform parses the task, retrieves Hive table metadata, and creates test nodes using admission/exit templates.

Execution is parallel; the final pipeline result is considered successful only if all test tasks succeed.

Core code (simplified):

// Get all fields, table type, primary key, partition info of the test Hive table
List<Map<String,String>> hiveTaskParams = getTestHiveParams(bigDataPipeline);

// Generate pipeline task name
String pipeTaskName = generatePipelineTaskName(bigDataPipeline,pipeType);

// Create pipeline task via the testing platform
PipelineTask task = monitorModuleService.createPipelineTask(bigDataPipeline,pipeType,pipeTaskName,hiveTaskParams);

// Run the task
Long runResult = monitorSchedulerService.generateTestScheduler(task.getId());

map.put("tektonId", String.valueOf(result_generate));

// Callback execution result to pipeline
pipelinService.callback(map);

3.1.2 Pipeline Result Notification

If the pipeline succeeds, no message is sent. If it fails, an automatic group notification is pushed to developers with a link to detailed results.

3.2 Admission/Exit Flow Control

The pipeline manages task changes, ensuring they meet entry standards and satisfy exit quality criteria before release.

3.2.1 Admission Flow

Data existence is a prerequisite; the test template validates partition data presence.

SELECT SUM(cnt) AS cnt
FROM (
  SELECT COUNT(1) AS cnt
  FROM (
    SELECT *
    FROM hive_gray_db.hive_table_name_XXX
    WHERE 1 = 1
      AND dt = '2025-05-08'
    LIMIT 1
  ) a
) a;

3.2.2 Exit Flow

Primary Key Uniqueness : use GROUP BY and HAVING to detect duplicates.

Data Volatility : compare gray‑space and online table counts to calculate flow rate.

Field Regression : compare unchanged fields between test and production tables.

Field Null‑Rate : calculate null or empty string ratios.

SELECT COUNT(1) AS repeatCnt
FROM (
  SELECT key_fields, COUNT(1) AS key_cnt
  FROM hive_gray_db.hive_table_name_XXX
  WHERE 1 = 1
    AND dt = '2025-05-08'
  GROUP BY key_fields
  HAVING COUNT(1) > 1
) a;

3.2.3 Node Assembly

New Hive tables trigger admission nodes for data existence and exit nodes for primary‑key uniqueness and null‑rate checks. Modified tables include all nodes, covering consistency, regression, and volatility.

3.3 Practical Application

The pipeline integrates with the offline release platform as a checkpoint: only when the pipeline succeeds can the task be released online.

Steps:

After gray‑task completion, the release platform sends task info to the pipeline.

The pipeline creates tasks, generates admission/exit nodes via templates.

Nodes are executed; results are written back.

Failure triggers an alert group; success allows release.

The pipeline result serves as a release gate.

Results

The platform achieved standardized processes, higher efficiency, and strong quality guarantees.

Process Standardization

Release checkpoints block deployments when admission or exit nodes fail.

Cross‑team transparency via automatic Feishu group notifications.

Efficiency Gains

Full‑pipeline automation eliminates manual testing, covering over 1,000 warehouse tasks, 3,900 iterations, and saving more than 240 person‑days.

Quality Improvements

Expanded coverage from zero to full pipeline, intercepting over 200 releases and discovering more than 300 issues.

Future Outlook

Intelligent Result Judgement : apply AI to reduce noise and false alarms.

More Nodes : add additional admission/exit nodes to increase test coverage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CI/CD quality assurance Pipeline

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.