Big Data 14 min read

Data Development and Testing: Process, Key Concerns, and Quality Monitoring

This article outlines the data development lifecycle, distinguishes it from application development, details the responsibilities and focus areas for data testers, and presents a comprehensive end‑to‑end quality monitoring and alert system for big‑data pipelines.

Fulu Network R&D Team

Sep 21, 2020

Data Development and Testing: Process, Key Concerns, and Quality Monitoring

1. Background

Big data and artificial intelligence are current and future focus areas for IT departments. New technologies can break profit bottlenecks and create growth points. Data middle platforms appear frequently in corporate reports, and demand for related technical positions exceeds supply, yet dedicated data testing roles are scarce.

We believe technology creates business value; big data must deliver its product (data) with quality control. Testers can seize this opportunity to transition into data testing, develop their own methodology, and participate in this technology‑driven wave.

2. What Data Development Does

2.1 Data Development Process

Obtain source data from business developers according to requirements.

Synchronize source data into tables on the data platform using appropriate tools.

Clean the data according to the model.

Write the cleaned results into result tables.

2.2 Differences Between Data Development and Application Development

Data volume is large, leading to high resource consumption and performance challenges.

Frequent use of algorithms and models for tasks such as user tagging, marketing prediction, and risk control.

The final deliverable is data, not code; data must be generated after code deployment.

Primary languages are SQL, Python, and Java; SQL is most used, with Python/Java for data transformation functions.

Strong and broad business knowledge is required because data development touches all business applications.

3. What Testers Need to Focus On

3.1 Starting Point of Data Testing

Testers entering data development testing should prepare:

Continuous monitoring of data flow, from source acquisition, cleaning rules, to final destinations, supported by metadata management and model design documents.

Deep familiarity with business domains, e.g., user profiling involves account and order systems.

Proficiency in SQL, Python, and Java to construct input data, execute queries/functions, and verify output.

3.2 Aspects of Data Testing

Accuracy

Completeness: data should be neither missing nor excessive after each transformation.

Correctness: data should meet business requirements or follow statistical patterns (e.g., 28‑law, normal distribution).

Timeliness: data should be cleaned and produced within expected time frames.

Performance

Resource usage.

API performance of data interfaces.

4. Data Development Testing Process

4.1 Importance of Documentation Standards

Testing requires documentation similar to functional testing, including:

Product requirement documents.

Model design documents describing ETL rules.

Scheduling design documents detailing job cycles, order, dependencies, and timing.

Release documents indicating whether data re‑runs are needed.

Company‑level database design specifications to align source system changes.

Synchronization of business changes that affect data extraction logic.

4.2 Separation of Development and Production Environments

Multiple independent environments (at least development and production) are needed to avoid impacting live services, with appropriate permission controls for different roles.

4.3 Data Development Testing Workflow

1. Full Process

2. Data Development Flow

Key output documents:

ETL documentation describing cleaning rules.

Scheduling design documentation.

Release operation documentation.

3. Testing Phase

– produces test cases and defects.

4. Release Phase

– produces release email.

5. Online Defect Repair Process

– data issues require data correction rather than code rollback.

5. End‑to‑End Data Quality Monitoring and Alert System

After code reliability is ensured through testing, production data differs in scale and coverage, necessitating a comprehensive big‑data monitoring system to guarantee timeliness and accuracy.

5.1 Monitoring Compute Resources

Data collection and cleaning consume significant compute resources. Monitoring dimensions include:

CPU usage (e.g., alert when 80 % of 100 CPU units are used).

Memory usage.

Number of waiting jobs.

Network bandwidth for data transfer.

5.2 Job Execution Monitoring

After deployment, jobs are scheduled; monitoring includes:

Job status – failures, execution time exceeding expectations, and timeliness violations.

Job logs – each node writes a log entry; missing logs trigger alerts.

5.3 Table and Field Level Data Monitoring

During ingestion and cleaning, monitor data integrity, consistency, and accuracy at table/field level, including row counts, periodic variance, null/uniqueness checks, and enforce strong or weak monitoring policies.

5.4 Balancing Quality, Efficiency, and Cost

Given large data volumes, high compute power improves quality but raises cost. Strategies include:

Data asset rating to prioritize monitoring (A‑grade, B‑grade, etc.).

Scheduling platform prioritizes jobs based on asset rating.

Alert channels (voice, SMS, email, DingTalk) chosen according to severity and cost.

Data lifecycle management to purge obsolete tables and free storage.

6. Conclusion

Technology must ultimately provide stable business services. By establishing robust processes, standards, execution, and monitoring, we can apply the same quality framework used in application development to data development, despite different tooling.

We welcome sharing of data testing practices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ETL Data Testing Quality Monitoring

Written by

Fulu Network R&D Team

Providing technical literature sharing for Fulu Holdings' tech elite, promoting its technologies through experience summaries, technology consolidation, and innovation sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.