30-Item Data Warehouse Development Checklist for Trustworthy, Efficient Data
This checklist compiles 30 actionable items covering model design, data consistency, performance, quality, metadata governance, cost efficiency, and collaboration to help data warehouse teams build trustworthy, high‑performance, and maintainable data pipelines.
1. Model Design and Layering (Avoid Duplication, Ensure Reusability)
All tables must belong to a specific layer (ODS/DWD/DWS/ADS). Prohibit cross‑layer direct queries (e.g., ADS directly reading ODS).
DWS layer should be organized by business domain (user, transaction, product, etc.) to avoid “wide tables” mixing multiple logics.
Core metrics such as GMV or DAU should be defined only once in the DWS layer; the ADS layer should contain only SELECT statements without rewriting logic.
Avoid complex JOINs or window functions in the ADS layer; move shared logic down to DWD/DWS.
Before creating a new table, search DataHub or the asset catalog to prevent duplicate assets.
2. Data Consistency (Unified Definitions)
All core metrics must have written definitions (calculation logic, filter conditions) stored in Confluence or Yuque.
Metric definitions should be marked as “official” and linked to the specific DWS table to prevent ad‑hoc calculations.
Automatically validate consistency of same‑named metrics (e.g., “GMV”) across reports; trigger alerts when deviation exceeds a threshold.
When metric definitions change, notify all downstream owners and assess impact via a change‑review mechanism.
3. Task Performance and Stability (SLA)
Partition tables must set appropriate partition fields (e.g., dt) to avoid full‑table scans.
Each partition should contain ≤ 1 billion rows to prevent data skew.
Regularly merge small files (<128 MB) to improve Hive/Spark read efficiency.
Set reasonable spark.sql.shuffle.partitions for shuffle jobs (recommended = total data size / 128 MB).
Configure Airflow SLA monitoring for critical tasks to generate timeout alerts automatically.
4. Data Quality (Trustworthiness)
ODS layer should monitor source record count fluctuations (±20%) to detect upstream interruptions.
DWD layer must enforce primary‑key uniqueness and non‑null field completeness, using tools like Great Expectations.
DWS/ADS layers should validate metric reasonableness (e.g., GMV ≥ 0) to avoid negative or abnormal peaks.
Failing quality checks should block downstream tasks to prevent data contamination.
5. Metadata and Governance (Maintainability)
All tables and columns must include COMMENT metadata; prohibit deployment without comments.
Each table must have an assigned Owner; enforce via DataHub or table‑creation standards.
Deploy automatic lineage collection (e.g., DataHub) to support rapid impact analysis.
Identify and clean “zombie tables” unused for 90 days; schedule monthly scans.
6. Cost and Efficiency (ROI)
Define table lifecycle policies (e.g., ODS retained for 30 days) to enable automatic archiving or deletion of cold data.
Monitor per‑task resource consumption (CPU, memory, I/O) to spot inefficient jobs.
Encourage self‑service BI data extraction to reduce ad‑hoc development workload.
7. Collaboration and Process (Sustainability)
Score new requirements using an evaluation form (impact, reusability, cost) to reject one‑off reports.
Manage SQL scripts in Git for version control, rollback, and code review.
Maintain a data dictionary and synchronize it with business stakeholders to lower communication cost.
Conduct regular data issue retrospectives (e.g., monthly governance meetings) for continuous process improvement.
Clearly differentiate “real‑time estimates” from “offline final values” to avoid business confusion.
Usage Scenarios
New project kickoff – use the checklist as a development gate.
Team weekly meetings – randomly sample five items for continuous improvement.
New‑hire onboarding – adopt the checklist as training material.
Incident post‑mortem – trace which items were missed during the failure.
Big Data Tech Team
Focuses on big data, data analysis, data warehousing, data middle platform, data science, Flink, AI and interview experience, side‑hustle earning and career planning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
