How We Turned AI Coding for Data Warehouses into an End‑to‑End Pipeline with Harness
The article analyzes why AI‑generated SQL alone cannot meet production data‑warehouse requirements, outlines four critical pain points, and presents a seven‑layer Harness framework that adds deterministic engineering controls, state persistence, skill registration, anti‑pattern libraries, and evidence‑based checks, achieving up to 94% time reduction and near‑zero side‑effects.
Problem Statement
Internal testing showed that large language models (LLMs) can generate syntactically correct SQL (≈90% correct) but fail to meet data‑warehouse requirements when additional constraints are added. Accuracy dropped to 8.6% after enforcing primary‑key uniqueness, no SELECT *, partition naming, data‑volume limits, and CRC32 fingerprint consistency, and fell to 0% when full delivery artifacts (rollback SQL, DQC checks, historical back‑fill scripts, change‑log entries) were required.
The failures stem from four pain points specific to data‑warehouse development:
Semantic drift across layers – business definitions change from ODS to DWD to DWS, and LLMs lose long‑range context.
Sensitivity to business definitions – a single join type or date filter error can break downstream reports.
High cost of back‑filling wide tables – a mistake may require terabytes of re‑processing and cascade to dozens of downstream tables.
Strict SLA constraints – production tables must be ready by early morning; any failure forces overnight firefighting.
Traditional Copilot‑style assistants only speed up typing and cannot address these issues.
Why Copilot Falls Short
Copilot can catch syntax errors post‑generation but cannot enforce cross‑layer business rules, prevent logical omissions, or provide the evidential guarantees required for production‑grade data‑warehouse releases.
Harness Framework Overview
Treat the LLM as a creative but forgetful engineer; use a Master orchestrator, SKILL registry, human‑in‑the‑loop (HITL) gates, persistent state, an anti‑pattern library, and mandatory self‑checks to achieve engineering‑grade reliability.
The framework consists of seven deterministic layers:
L1 – Master Orchestrator : explicit state machine with stage gates; reads a single source of truth JSON ( _state.json) at each session.
L2 – SKILL Registry : modular .md files describing trigger keywords, capability boundaries, dependencies, execution steps, required artifacts, and known pitfalls.
L3 – MCP Tool Bus : sandboxed execution of external tools (SQL, TAPD, Feishu, Tableau) with no direct production writes.
L4 – Human‑in‑the‑Loop Gates : five low‑cost decision points (G1–G5) where humans review evidence before the workflow proceeds.
L5 – State Persistence : _state.json stores the authoritative workflow state and artifact tree, enabling cross‑session continuity.
L6 – Anti‑Pattern Library : anti_pattern_library.md records past AI mistakes (error code, scenario, observed phenomenon, root cause, mitigation).
L7 – Mandatory Self‑Check : at the final T1 stage the LLM must answer five evidence‑based questions (e.g., DQC pass logs, CRC32 fingerprint equality, Feishu changelog revision ID).
Key Design Details
1. SKILL Registry vs. IDE Hooks
Hooks can only enforce post‑hoc syntax checks and provide opaque rejections. SKILL files act as pre‑emptive, explainable constraints that are loaded on demand, allowing the LLM to understand *why* a rule exists. Each SKILL defines:
Trigger semantics (keywords that activate the skill).
Capability boundaries (what the skill can and cannot do).
Dependencies (required MCP tools or files).
Execution steps (e.g., six DQC SQL templates).
Required artifacts (validation reports, generated SQL files).
Common pitfalls (negative samples for anti‑pattern injection).
2. Persistent State Machine
Cross‑session execution is driven by a JSON state file rather than the LLM’s volatile context. Example _state.json:
{
"req_id": "Indonesia three‑party log sync",
"phase": "M3_done",
"feishu_enabled": true,
"impacted_cols": ["business_type", "third_party_scene_code"],
"unchanged_cols_fp": "crc32_sum=8843718233",
"hitl_gates_passed": ["S1","S5","M1","M3"],
"hitl_gates_pending": ["T1"],
"current_master_skill_version": "v1.6",
"anti_pattern_hits": ["feishu_changelog_skipped"],
"last_updated": "20260601T17:42:00+08:00"
}The Master reads this file at the start of every new session, reconstructing the workflow without relying on previous chat history.
3. Human‑in‑the‑Loop (HITL) Gates
Five gates are placed where the value of a human decision outweighs the cost of correction:
G1 (S1) – after requirement analysis, verify that the parsed business definition matches the original request.
G2 (S5) – after model design, confirm DDL, mapping, and dependency graph.
G3 (M1) – after code blueprint generation, review the five code artifacts (DDL, ETL, back‑fill, rollback, DQC).
G4 (M3) – after self‑testing, validate DQC results, consistency diffs, and back‑fill readiness.
G5 (T1) – final delivery, require evidence for change‑log entry, CRC32 equality, and rollout declaration.
Each gate presents a structured evidence report rather than an open‑ended question, turning “human review” into a concise “approve or reject” action.
4. Anti‑Pattern Library
Past AI failures are cataloged with fields: error code (e.g., AP‑007), scenario, observed phenomenon, root cause, and mitigation. SKILL files reference relevant entries, and the Master injects these negative samples into the LLM context before the next step. This reduced repeat‑error rates from 47% to 6% in production.
5. Mandatory Self‑Check (Answer‑with‑Evidence)
At the T1 stage the LLM must answer five concrete questions, providing raw evidence such as:
Q1: Did all 6 DQC rules PASS? – attach each DQC SQL <em>query_id</em> and status.
Q2: Are existing fields unchanged? – show dev vs prod CRC32 fingerprint sum for the partition.
Q3: Was the change‑log written to Feishu? – provide <em>documentRevisionId</em> and URL.
Q4: Is the historical back‑fill script complete? – list script path and covered partitions.
Q5: Were downstream owners notified? – list owners and notification method.These questions map directly to the six hard delivery criteria (syntax, business definition, consistency, DQC, rollback, change‑log), eliminating “virtual done” reports.
End‑to‑End Case Study: Indonesia Three‑Party Log Table Sync
Task: add two new metrics to a 1,800‑row wide table with 500+ core fields, back‑fill from 20260525, and guarantee zero side‑effects.
Performance : total wall‑clock time 47 minutes (33 min AI work, 14 min human review). Compared to the previous manual process of ~2 person‑days, this is a 25× speedup.
Zero‑Side‑Effect Verification – a CRC32 fingerprint query ensures that the dev and prod partitions have identical sums for all existing fields:
SELECT COUNT(*) AS row_cnt,
SUM(CRC32(CONCAT_WS('|',
COALESCE(CAST(uuid AS STRING), ''),
COALESCE(CAST(user_id AS STRING), ''),
...,
COALESCE(CAST(create_time AS STRING), ''
))) ) AS fp_sum
FROM ${target_table}
WHERE dt = '${p_date}';Both environments produced the same fp_sum, confirming no unintended changes.
ROI and Effectiveness
Stage‑level time reductions (manual vs. Harness) ranged from 87% to 95%:
Requirement analysis: 2‑3 h → 8 min (95% reduction).
Upstream profiling: 1‑2 h → 4 min (92%).
Model design: 2‑4 h → 8 min (94%).
ETL coding: 2‑3 h → 12 min (90%).
Self‑testing & DQC: 3‑5 h → 12 min (92%).
CR & release: 1‑2 h → 8 min (87%).
Delivery testing: 1‑2 h → 5 min (92%).
Overall human effort per demand dropped from ~14 person‑hours to ~5 minutes, freeing roughly 10 person‑months per month for a ten‑engineer team.
Zero‑side‑effect success rate improved from 38% (pre‑Harness) to 96% in version v1.6.
Design Philosophy Comparison
Compared to Copilot (assistive) and fully autonomous agents (self‑driving), Harness emphasizes a controlled collaborative workflow where humans make critical decisions, the system proactively intercepts errors, and deterministic layers guarantee reliability.
Future Plans
Refactor the Master Orchestrator into multiple sub‑agents (verifier, reporter, etc.) for better isolation.
Build a visual console to replace raw _state.json, showing progress, gate status, and artifact previews.
Generalize the seven‑layer architecture to other high‑sensitivity domains such as financial risk, marketing metrics, and reporting.
Conclusion
Harness is a living system that evolves with each failure mode. By turning AI‑generated code into an engineering‑grade, zero‑risk delivery pipeline, it transforms AI from a “nice‑to‑have” feature into a reliable production tool for data‑warehouse workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
