Big Data 26 min read

Boosting SQL Compliance to 95%: Harness Solves AI’s “Memory Loss” in Data Warehouse

The article analyzes the challenges of AI‑generated SQL in a data‑warehouse environment—context loss, unstable rule enforcement, and token overflow—and presents a five‑layer Harness architecture that persists constraints, injects hooks, uses subagents, and refactors SKILL files, raising SQL compliance from 70‑80% to over 95% while reducing context compacting.

dbaplus Community
dbaplus Community
dbaplus Community
Boosting SQL Compliance to 95%: Harness Solves AI’s “Memory Loss” in Data Warehouse

1. AI‑Coding Pain Points in Data‑Warehouse Development

DeWu’s data‑warehouse teams use Claude Code with an IDE plugin. Three structural issues arise:

Context loss : When the token count reaches ~95 % of the limit, Claude Code compresses the conversation to a summary (≈12 % of tokens). Temporary constraints such as “amount field is in thousands” are discarded, leading to errors like a 1,000‑fold data discrepancy.

Inconsistent rule enforcement : Standards (OneData naming, three‑part comments, mandatory PARTITION clauses) are obeyed only 60‑70 % of the time under tight schedules; prompt‑based memory yields 70‑80 % compliance.

Context overflow in large projects : Lineage queries, self‑test result sets, and SKILL files quickly fill the context window, triggering more compact operations and low‑level mistakes.

The contradiction is that higher complexity drives greater reliance on AI, yet complexity also accelerates context overflow and “memory loss”.

2. What Is Harness?

In Claude Code’s update-config description, “Automated behaviors require hooks configured in settings.json — the harness executes these, not Claude.” Harness is the host runtime framework for Claude Code. It manages the context lifecycle, runs hooks deterministically outside the LLM inference loop, and coordinates sub‑agents.

Harness architecture diagram
Harness architecture diagram

3. What Does the Compact Mechanism Discard?

When the context reaches the compact threshold (≈95 % of token limit), Claude replaces the full conversation with a summary, reducing token count to ~12 % of the original. Lost information includes:

Temporary constraints (e.g., “use OVERWRITE mode”, “ignore field_a”).

Early steps of a SKILL file, causing Claude to re‑ask for missing details.

Large self‑test result sets and extensive lineage queries.

Consequently, complex data‑warehouse tasks suffer from missing constraints after compacting.

Compact loss illustration
Compact loss illustration

4. Five‑Layer Defense System

Layer 1 – Persistent .claude/CLAUDE.md Store iteration‑level constraints in this file. After each compact, the file is re‑injected, guaranteeing that critical information (field units, iteration constraints) never disappears.

# Current iteration status
## Developing‑table: db_a.dws_table_a
Version: V1.0
node_id: 1000000001
Status: ETL development step 3/8

## Iteration constraints
- Do not modify: dwd_table_b (read‑only)
- Partition field: partition_dt (format yyyyMMdd)
- amount field unit: thousand (not unit)

## Global warehouse rules
- Partition field must be partition_dt string
- Disallow SELECT *, UPDATE/DELETE without WHERE
- amount field must use DECIMAL(20,4), not DOUBLE
- INSERT must include PARTITION clause

Layer 2 – Auto Memory Claude automatically writes important notes to ~/.claude/projects/<project>/memory/MEMORY.md before compacting. Example prompts: “Remember that the amount field is in thousands” or “field_a may be null in this scenario”. The memory file is re‑injected after each compact.

Layer 3 – Hook‑Based Automatic Validation After every .sql file write, a PostToolUse hook runs validate_sql.sh . Violations (e.g., SELECT * or missing PARTITION ) cause the hook to exit with code 2, forcing Claude to correct the file before proceeding.

# .claude/hooks/validate_sql.sh
#!/bin/bash
INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('tool_input',{}).get('file_path',''))")
[[ "$FILE_PATH" != *.sql ]] && exit 0
SQL=$(cat "$FILE_PATH")
ERRORS=()
# Rule 1: Disallow SELECT *
echo "$SQL" | grep -iqE 'SELECT\s+\*' && ERRORS+=("CRITICAL: SELECT * found, specify columns")
# Rule 2: INSERT must have PARTITION
if echo "$SQL" | grep -iqE 'INSERT\s+(INTO|OVERWRITE)'; then
  echo "$SQL" | grep -iqE 'PARTITION\s*\(' || ERRORS+=("CRITICAL: INSERT missing PARTITION clause")
fi
# Rule 3: Avoid DOUBLE for amount fields
echo "$SQL" | grep -iqE '\bDOUBLE\b' && ERRORS+=("WARNING: Use DECIMAL(20,4) for amount fields instead of DOUBLE")
# Rule 4: UPDATE/DELETE must have WHERE
if echo "$SQL" | grep -iqE '\b(UPDATE|DELETE)\b'; then
  echo "$SQL" | grep -iqE '\bWHERE\b' || ERRORS+=("CRITICAL: UPDATE/DELETE without WHERE clause")
fi
if [ ${#ERRORS[@]} -gt 0 ]; then
  echo "=== SQL CHECK FAILED: $FILE_PATH ===" >&2
  for err in "${ERRORS[@]}"; do echo "  $err" >&2; done
  exit 2
fi
echo "SQL check passed: $(basename $FILE_PATH)" >&2
exit 0

Layer 4 – Sub‑Agents for Context Isolation High‑token operations (lineage queries, large self‑test result sets, SKILL file reads) are delegated to dedicated sub‑agents. The main session only receives a concise summary, preventing compact triggers.

# .claude/agents/sql-validator.md
---
name: sql-validator
description: ODPS/MaxCompute SQL syntax validation and rule checking.
tools: Read, Bash, Grep, Glob
model: haiku
permissionMode: dontAsk
---
You are a data‑warehouse SQL compliance expert. Validate the following rules:
1. Disallow SELECT *
2. INSERT must include PARTITION
3. Use snake_case for column names
4. amount fields must be DECIMAL(20,4), not DOUBLE
5. JOIN must have ON clause
6. Detect Cartesian product risk
---
Output format:
- Status: PASS / FAIL
- Issue list (CRITICAL / WARNING / INFO)
- Line‑specific suggestions

Layer 5 – Refactored SKILL Files Instead of loading the entire SKILL file (≈10 KB) into the main context, the file is read inside a sub‑agent. The main session only receives the generated SQL file path.

# .claude/rules/etl-rules.md
paths:
  - "**/*insert*.sql"
  - "**/*_di.sql"
  - "**/*_df.sql"
---
ETL development rules (auto‑loaded by path):
- Must have partition_dt partition
- INSERT OVERWRITE must check partition existence first
- Disallow cross‑database JOIN

5. Practical Harness Architecture for the Data Warehouse

The architecture consists of three responsibility layers:

Persistence Layer : Stores constraints in .claude/CLAUDE.md, eliminating “memory loss”.

Harness Layer (Hooks) : Enforces rules deterministically after each write, raising compliance from ~70‑80 % to >95 %.

Sub‑Agent Layer : Isolates token‑heavy operations, reducing compact frequency by an estimated 50‑70 %.

Eight major workflow steps map onto these layers, ensuring low‑token steps stay in the main session while high‑token steps run in sub‑agents.

6. SKILL‑Based Data‑Warehouse Workflow Design

The eight SKILL steps (需求分析 → 技术设计 → ETL → 自测 → 数据比对 → SR 导入 → 性能优化 → SLA/DQC) remain unchanged; only the invocation method changes. Each step is triggered by a concise prompt that calls the appropriate sub‑agent or hook.

# Example for Step 3 (ETL Development)
Generate DDL and INSERT SQL for db_a.dws_table_a with:
- INSERT must use OVERWRITE mode
- PARTITION clause must contain partition_dt
- amount field must be DECIMAL(20,4) in thousands
After generation, run sql-validator sub‑agent to check both files.

7. Detailed Landing Steps

Create .claude/CLAUDE.md with current iteration status and constraints.

Configure .claude/settings.json and .claude/hooks/ to run validate_sql.sh after each .sql write and block_dangerous_ddl.sh before DDL execution.

Implement three core sub‑agents: sql-validator.md, dw-explorer.md, and data-quality-checker.md, moving heavy operations out of the main context.

8. Core Problems Harness Solves

Forgotten field units : After compact, Claude generated SQL that treated “amount” as units of 1 instead of 1,000, causing a 1,000× error. Harness stores the unit in CLAUDE.md and Auto Memory, eliminating the error.

Requirement misunderstanding : Mismatched business definitions (e.g., GMV from user view vs. transaction view) caused rework. Harness forces explicit requirement drafts and confirmation checklists into CLAUDE.md, raising first‑pass delivery from ~50 % to 90 %.

Inconsistent SQL rules : Missing PARTITION clauses persisted. Post‑tool hooks now enforce rules automatically, lifting compliance to >95 %.

Context exhaustion in large projects : Massive lineage queries and self‑test outputs overflowed the context window. Sub‑agents isolate these operations, cutting compact frequency by roughly half.

9. Quantitative Impact

SQL compliance increased from 70‑80 % (memory‑based) to >95 % (hook‑based).

Context compact triggers reduced by 50‑70 %.

Self‑test result handling shrank from full result sets to concise PASS/FAIL summaries.

Quantitative impact chart
Quantitative impact chart

In summary, Harness converts AI‑assisted data‑warehouse development from a fragile, conversation‑driven aid into a reliable, rule‑embedded pipeline where deterministic hooks and sub‑agents work together to achieve high accuracy and stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SQLAIautomationdata-warehouseHooksSubagents
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.