How HuoLala Built a Scalable Big Data Warehouse CI/CD Pipeline
This article details how HuoLala designed and implemented a standardized, automated big‑data warehouse release pipeline that integrates admission and exit checkpoints, isolates test data, and uses CI/CD practices to improve efficiency, reduce risk, and achieve near‑zero manual testing for data quality assurance.
Background and Challenges
Unlike traditional project releases, HuoLala’s big‑data offline data‑warehouse tasks are released on the offline development platform, but the current release process suffers from lack of strict control, poor test awareness, and no isolated test tables, leading to production data pollution.
Warehouse Release Process Issues
Release process is loosely controlled.
Testers have no visibility of releases.
No dedicated test tables, causing test data to pollute production.
Warehouse Task Situation
Many self‑test tasks with low awareness, prone to problems.
Frequent task changes with wide impact, manual testing is time‑consuming.
To address these problems, a standardized release process combined with engineering practices and quality‑efficiency tools was designed to build an integrated solution aiming at a more efficient, higher‑quality continuous delivery pipeline and eventually achieve fully automated testing.
Solution and Goals
We collaborated with the big‑data offline development platform and adopted the following measures:
Release Process Standardization : add approval and entry/exit checkpoints.
Physical Isolation of Test Data : write test data to a gray‑space to avoid contaminating online data.
Automated Admission/Exit : automatically detect changeable tasks, trigger checks, and block releases on failure.
Target outcomes include risk mitigation, early feedback, strict CI/CD standards, and reduced communication cost.
Capability Construction
The Big Data Warehouse Pipeline platform, built on Spring Boot, relies on the big‑data testing platform for task computation. It provides admission and exit functions and a checkpoint mechanism, consisting of presentation, service, storage, and data layers.
3.1 Platform Basic Capabilities
The platform supports pipeline generation and execution, result notification, task‑template application, warehouse task change analysis, node composition, and visualisation.
Pipeline Generation & Execution : automatic creation and management of data‑processing pipelines.
Result Notification : deliver results via alerts, reports, and visual dashboards.
Task Template Application : use predefined templates to standardise test task nodes.
Warehouse Task Change Analysis : generate test nodes based on change types.
Node Composition : flexibly combine admission/exit nodes to meet diverse checkpoints.
Visualisation : intuitive tools for analysing node flows and results.
3.1.1 Pipeline Generation and Execution
After a warehouse task is completed, the system runs it in a gray environment to generate Hive data, then triggers the pipeline.
The platform parses the task, retrieves Hive table metadata, and creates test nodes using admission/exit templates.
Execution is parallel; the final pipeline result is considered successful only if all test tasks succeed.
Core code (simplified):
// Get all fields, table type, primary key, partition info of the test Hive table
List<Map<String,String>> hiveTaskParams = getTestHiveParams(bigDataPipeline);
// Generate pipeline task name
String pipeTaskName = generatePipelineTaskName(bigDataPipeline,pipeType);
// Create pipeline task via the testing platform
PipelineTask task = monitorModuleService.createPipelineTask(bigDataPipeline,pipeType,pipeTaskName,hiveTaskParams);
// Run the task
Long runResult = monitorSchedulerService.generateTestScheduler(task.getId());
map.put("tektonId", String.valueOf(result_generate));
// Callback execution result to pipeline
pipelinService.callback(map);3.1.2 Pipeline Result Notification
If the pipeline succeeds, no message is sent. If it fails, an automatic group notification is pushed to developers with a link to detailed results.
3.2 Admission/Exit Flow Control
The pipeline manages task changes, ensuring they meet entry standards and satisfy exit quality criteria before release.
3.2.1 Admission Flow
Data existence is a prerequisite; the test template validates partition data presence.
SELECT SUM(cnt) AS cnt
FROM (
SELECT COUNT(1) AS cnt
FROM (
SELECT *
FROM hive_gray_db.hive_table_name_XXX
WHERE 1 = 1
AND dt = '2025-05-08'
LIMIT 1
) a
) a;3.2.2 Exit Flow
Primary Key Uniqueness : use GROUP BY and HAVING to detect duplicates.
Data Volatility : compare gray‑space and online table counts to calculate flow rate.
Field Regression : compare unchanged fields between test and production tables.
Field Null‑Rate : calculate null or empty string ratios.
SELECT COUNT(1) AS repeatCnt
FROM (
SELECT key_fields, COUNT(1) AS key_cnt
FROM hive_gray_db.hive_table_name_XXX
WHERE 1 = 1
AND dt = '2025-05-08'
GROUP BY key_fields
HAVING COUNT(1) > 1
) a;3.2.3 Node Assembly
New Hive tables trigger admission nodes for data existence and exit nodes for primary‑key uniqueness and null‑rate checks. Modified tables include all nodes, covering consistency, regression, and volatility.
3.3 Practical Application
The pipeline integrates with the offline release platform as a checkpoint: only when the pipeline succeeds can the task be released online.
Steps:
After gray‑task completion, the release platform sends task info to the pipeline.
The pipeline creates tasks, generates admission/exit nodes via templates.
Nodes are executed; results are written back.
Failure triggers an alert group; success allows release.
The pipeline result serves as a release gate.
Results
The platform achieved standardized processes, higher efficiency, and strong quality guarantees.
Process Standardization
Release checkpoints block deployments when admission or exit nodes fail.
Cross‑team transparency via automatic Feishu group notifications.
Efficiency Gains
Full‑pipeline automation eliminates manual testing, covering over 1,000 warehouse tasks, 3,900 iterations, and saving more than 240 person‑days.
Quality Improvements
Expanded coverage from zero to full pipeline, intercepting over 200 releases and discovering more than 300 issues.
Future Outlook
Intelligent Result Judgement : apply AI to reduce noise and false alarms.
More Nodes : add additional admission/exit nodes to increase test coverage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
