How We Turned AI Coding for Data Warehouses into an End‑to‑End Pipeline with Harness

The article analyzes why AI‑generated SQL alone cannot meet production data‑warehouse requirements, outlines four critical pain points, and presents a seven‑layer Harness framework that adds deterministic engineering controls, state persistence, skill registration, anti‑pattern libraries, and evidence‑based checks, achieving up to 94% time reduction and near‑zero side‑effects.

DataFunSummit
DataFunSummit
DataFunSummit
How We Turned AI Coding for Data Warehouses into an End‑to‑End Pipeline with Harness

Problem Statement

Internal testing showed that large language models (LLMs) can generate syntactically correct SQL (≈90% correct) but fail to meet data‑warehouse requirements when additional constraints are added. Accuracy dropped to 8.6% after enforcing primary‑key uniqueness, no SELECT *, partition naming, data‑volume limits, and CRC32 fingerprint consistency, and fell to 0% when full delivery artifacts (rollback SQL, DQC checks, historical back‑fill scripts, change‑log entries) were required.

The failures stem from four pain points specific to data‑warehouse development:

Semantic drift across layers – business definitions change from ODS to DWD to DWS, and LLMs lose long‑range context.

Sensitivity to business definitions – a single join type or date filter error can break downstream reports.

High cost of back‑filling wide tables – a mistake may require terabytes of re‑processing and cascade to dozens of downstream tables.

Strict SLA constraints – production tables must be ready by early morning; any failure forces overnight firefighting.

Traditional Copilot‑style assistants only speed up typing and cannot address these issues.

Why Copilot Falls Short

Copilot can catch syntax errors post‑generation but cannot enforce cross‑layer business rules, prevent logical omissions, or provide the evidential guarantees required for production‑grade data‑warehouse releases.

Harness Framework Overview

Treat the LLM as a creative but forgetful engineer; use a Master orchestrator, SKILL registry, human‑in‑the‑loop (HITL) gates, persistent state, an anti‑pattern library, and mandatory self‑checks to achieve engineering‑grade reliability.

The framework consists of seven deterministic layers:

L1 – Master Orchestrator : explicit state machine with stage gates; reads a single source of truth JSON ( _state.json) at each session.

L2 – SKILL Registry : modular .md files describing trigger keywords, capability boundaries, dependencies, execution steps, required artifacts, and known pitfalls.

L3 – MCP Tool Bus : sandboxed execution of external tools (SQL, TAPD, Feishu, Tableau) with no direct production writes.

L4 – Human‑in‑the‑Loop Gates : five low‑cost decision points (G1–G5) where humans review evidence before the workflow proceeds.

L5 – State Persistence : _state.json stores the authoritative workflow state and artifact tree, enabling cross‑session continuity.

L6 – Anti‑Pattern Library : anti_pattern_library.md records past AI mistakes (error code, scenario, observed phenomenon, root cause, mitigation).

L7 – Mandatory Self‑Check : at the final T1 stage the LLM must answer five evidence‑based questions (e.g., DQC pass logs, CRC32 fingerprint equality, Feishu changelog revision ID).

Key Design Details

1. SKILL Registry vs. IDE Hooks

Hooks can only enforce post‑hoc syntax checks and provide opaque rejections. SKILL files act as pre‑emptive, explainable constraints that are loaded on demand, allowing the LLM to understand *why* a rule exists. Each SKILL defines:

Trigger semantics (keywords that activate the skill).

Capability boundaries (what the skill can and cannot do).

Dependencies (required MCP tools or files).

Execution steps (e.g., six DQC SQL templates).

Required artifacts (validation reports, generated SQL files).

Common pitfalls (negative samples for anti‑pattern injection).

2. Persistent State Machine

Cross‑session execution is driven by a JSON state file rather than the LLM’s volatile context. Example _state.json:

{
  "req_id": "Indonesia three‑party log sync",
  "phase": "M3_done",
  "feishu_enabled": true,
  "impacted_cols": ["business_type", "third_party_scene_code"],
  "unchanged_cols_fp": "crc32_sum=8843718233",
  "hitl_gates_passed": ["S1","S5","M1","M3"],
  "hitl_gates_pending": ["T1"],
  "current_master_skill_version": "v1.6",
  "anti_pattern_hits": ["feishu_changelog_skipped"],
  "last_updated": "20260601T17:42:00+08:00"
}

The Master reads this file at the start of every new session, reconstructing the workflow without relying on previous chat history.

3. Human‑in‑the‑Loop (HITL) Gates

Five gates are placed where the value of a human decision outweighs the cost of correction:

G1 (S1) – after requirement analysis, verify that the parsed business definition matches the original request.

G2 (S5) – after model design, confirm DDL, mapping, and dependency graph.

G3 (M1) – after code blueprint generation, review the five code artifacts (DDL, ETL, back‑fill, rollback, DQC).

G4 (M3) – after self‑testing, validate DQC results, consistency diffs, and back‑fill readiness.

G5 (T1) – final delivery, require evidence for change‑log entry, CRC32 equality, and rollout declaration.

Each gate presents a structured evidence report rather than an open‑ended question, turning “human review” into a concise “approve or reject” action.

4. Anti‑Pattern Library

Past AI failures are cataloged with fields: error code (e.g., AP‑007), scenario, observed phenomenon, root cause, and mitigation. SKILL files reference relevant entries, and the Master injects these negative samples into the LLM context before the next step. This reduced repeat‑error rates from 47% to 6% in production.

5. Mandatory Self‑Check (Answer‑with‑Evidence)

At the T1 stage the LLM must answer five concrete questions, providing raw evidence such as:

Q1: Did all 6 DQC rules PASS? – attach each DQC SQL <em>query_id</em> and status.
Q2: Are existing fields unchanged? – show dev vs prod CRC32 fingerprint sum for the partition.
Q3: Was the change‑log written to Feishu? – provide <em>documentRevisionId</em> and URL.
Q4: Is the historical back‑fill script complete? – list script path and covered partitions.
Q5: Were downstream owners notified? – list owners and notification method.

These questions map directly to the six hard delivery criteria (syntax, business definition, consistency, DQC, rollback, change‑log), eliminating “virtual done” reports.

End‑to‑End Case Study: Indonesia Three‑Party Log Table Sync

Task: add two new metrics to a 1,800‑row wide table with 500+ core fields, back‑fill from 20260525, and guarantee zero side‑effects.

Performance : total wall‑clock time 47 minutes (33 min AI work, 14 min human review). Compared to the previous manual process of ~2 person‑days, this is a 25× speedup.

Zero‑Side‑Effect Verification – a CRC32 fingerprint query ensures that the dev and prod partitions have identical sums for all existing fields:

SELECT COUNT(*) AS row_cnt,
       SUM(CRC32(CONCAT_WS('|',
           COALESCE(CAST(uuid AS STRING), ''),
           COALESCE(CAST(user_id AS STRING), ''),
           ...,
           COALESCE(CAST(create_time AS STRING), ''
       ))) ) AS fp_sum
FROM ${target_table}
WHERE dt = '${p_date}';

Both environments produced the same fp_sum, confirming no unintended changes.

ROI and Effectiveness

Stage‑level time reductions (manual vs. Harness) ranged from 87% to 95%:

Requirement analysis: 2‑3 h → 8 min (95% reduction).

Upstream profiling: 1‑2 h → 4 min (92%).

Model design: 2‑4 h → 8 min (94%).

ETL coding: 2‑3 h → 12 min (90%).

Self‑testing & DQC: 3‑5 h → 12 min (92%).

CR & release: 1‑2 h → 8 min (87%).

Delivery testing: 1‑2 h → 5 min (92%).

Overall human effort per demand dropped from ~14 person‑hours to ~5 minutes, freeing roughly 10 person‑months per month for a ten‑engineer team.

Zero‑side‑effect success rate improved from 38% (pre‑Harness) to 96% in version v1.6.

Design Philosophy Comparison

Compared to Copilot (assistive) and fully autonomous agents (self‑driving), Harness emphasizes a controlled collaborative workflow where humans make critical decisions, the system proactively intercepts errors, and deterministic layers guarantee reliability.

Future Plans

Refactor the Master Orchestrator into multiple sub‑agents (verifier, reporter, etc.) for better isolation.

Build a visual console to replace raw _state.json, showing progress, gate status, and artifact previews.

Generalize the seven‑layer architecture to other high‑sensitivity domains such as financial risk, marketing metrics, and reporting.

Conclusion

Harness is a living system that evolves with each failure mode. By turning AI‑generated code into an engineering‑grade, zero‑risk delivery pipeline, it transforms AI from a “nice‑to‑have” feature into a reliable production tool for data‑warehouse workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

engineeringAIAutomationLLMData Warehouse
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.