Data Development and Testing: Process, Key Concerns, and Quality Monitoring
This article outlines the data development lifecycle, distinguishes it from application development, details the responsibilities and focus areas for data testers, and presents a comprehensive end‑to‑end quality monitoring and alert system for big‑data pipelines.
1. Background
Big data and artificial intelligence are current and future focus areas for IT departments. New technologies can break profit bottlenecks and create growth points. Data middle platforms appear frequently in corporate reports, and demand for related technical positions exceeds supply, yet dedicated data testing roles are scarce.
We believe technology creates business value; big data must deliver its product (data) with quality control. Testers can seize this opportunity to transition into data testing, develop their own methodology, and participate in this technology‑driven wave.
2. What Data Development Does
2.1 Data Development Process
Obtain source data from business developers according to requirements.
Synchronize source data into tables on the data platform using appropriate tools.
Clean the data according to the model.
Write the cleaned results into result tables.
2.2 Differences Between Data Development and Application Development
Data volume is large, leading to high resource consumption and performance challenges.
Frequent use of algorithms and models for tasks such as user tagging, marketing prediction, and risk control.
The final deliverable is data, not code; data must be generated after code deployment.
Primary languages are SQL, Python, and Java; SQL is most used, with Python/Java for data transformation functions.
Strong and broad business knowledge is required because data development touches all business applications.
3. What Testers Need to Focus On
3.1 Starting Point of Data Testing
Testers entering data development testing should prepare:
Continuous monitoring of data flow, from source acquisition, cleaning rules, to final destinations, supported by metadata management and model design documents.
Deep familiarity with business domains, e.g., user profiling involves account and order systems.
Proficiency in SQL, Python, and Java to construct input data, execute queries/functions, and verify output.
3.2 Aspects of Data Testing
Accuracy Completeness: data should be neither missing nor excessive after each transformation. Correctness: data should meet business requirements or follow statistical patterns (e.g., 28‑law, normal distribution).
Timeliness: data should be cleaned and produced within expected time frames.
Performance Resource usage. API performance of data interfaces.
4. Data Development Testing Process
4.1 Importance of Documentation Standards
Testing requires documentation similar to functional testing, including:
Product requirement documents.
Model design documents describing ETL rules.
Scheduling design documents detailing job cycles, order, dependencies, and timing.
Release documents indicating whether data re‑runs are needed.
Company‑level database design specifications to align source system changes.
Synchronization of business changes that affect data extraction logic.
4.2 Separation of Development and Production Environments
Multiple independent environments (at least development and production) are needed to avoid impacting live services, with appropriate permission controls for different roles.
4.3 Data Development Testing Workflow
1. Full Process
2. Data Development Flow
Key output documents:
ETL documentation describing cleaning rules.
Scheduling design documentation.
Release operation documentation.
3. Testing Phase – produces test cases and defects.
4. Release Phase – produces release email.
5. Online Defect Repair Process – data issues require data correction rather than code rollback.
5. End‑to‑End Data Quality Monitoring and Alert System
After code reliability is ensured through testing, production data differs in scale and coverage, necessitating a comprehensive big‑data monitoring system to guarantee timeliness and accuracy.
5.1 Monitoring Compute Resources
Data collection and cleaning consume significant compute resources. Monitoring dimensions include:
CPU usage (e.g., alert when 80 % of 100 CPU units are used).
Memory usage.
Number of waiting jobs.
Network bandwidth for data transfer.
5.2 Job Execution Monitoring
After deployment, jobs are scheduled; monitoring includes:
Job status – failures, execution time exceeding expectations, and timeliness violations.
Job logs – each node writes a log entry; missing logs trigger alerts.
5.3 Table and Field Level Data Monitoring
During ingestion and cleaning, monitor data integrity, consistency, and accuracy at table/field level, including row counts, periodic variance, null/uniqueness checks, and enforce strong or weak monitoring policies.
5.4 Balancing Quality, Efficiency, and Cost
Given large data volumes, high compute power improves quality but raises cost. Strategies include:
Data asset rating to prioritize monitoring (A‑grade, B‑grade, etc.).
Scheduling platform prioritizes jobs based on asset rating.
Alert channels (voice, SMS, email, DingTalk) chosen according to severity and cost.
Data lifecycle management to purge obsolete tables and free storage.
6. Conclusion
Technology must ultimately provide stable business services. By establishing robust processes, standards, execution, and monitoring, we can apply the same quality framework used in application development to data development, despite different tooling.
We welcome sharing of data testing practices.
Fulu Network R&D Team
Providing technical literature sharing for Fulu Holdings' tech elite, promoting its technologies through experience summaries, technology consolidation, and innovation sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.