Why Data Quality Is the Key to Successful Big Data Initiatives
The article explains that while big data aims to boost organizational insight and innovation, its true value depends on high data quality, outlines industry standards, identifies technical, business, and management causes of poor quality, and proposes a three‑phase strategy of prevention, monitoring, and post‑improvement to ensure reliable data for decision‑making.
As widely recognized, the goal of big data construction is to integrate organizational data, enhance insight and competitiveness, and enable business innovation and industry upgrade. Improving data quality consolidates big data achievements and resolves issues where data quality fails to meet business needs; it is not merely a technical problem but also appears in business and management processes. Understanding the industry, organization, and business is essential, and as "Data Doctor" Jim Barker notes, simple tools and rules can solve 80% of problems, while a complex system can address 100% depending on the desired quality standard.
Big Data Industry Background and Status
In 2014 big data entered the government work report, marking the policy year; subsequent years saw national top‑level designs and strategic plans, culminating in the 2019 push for new‑infrastructure and the transition from a "data‑big" to a "data‑strong" nation.
A 2019 Huawei survey of digital transformation showed only 5% of enterprises were watching, 31% planning, 36% piloting, and 26% fully deploying, meaning 95% had begun digital transformation.
After six years of rapid growth, big data has entered two stages: first, data collection, governance, and value exploration; second, value realization. Many governments and enterprises have completed the first stage and are now moving to the second, seeking business innovation and industry upgrade.
Nevertheless, big data development still faces difficulties such as lack of overall planning, insufficient senior support, departmental silos, inadequate business value, and limited technical capability. The core issue is insufficient business value, which hampers leadership and funding support; thus, digital transformation must be value‑driven.
Data Quality Issues in Big Data Development
Data quality is a prerequisite for realizing data value; without 100% quality, decisions based on flawed data can be disastrous, eroding confidence in big data.
Research by Wang Zhihong (Harbin Institute of Technology) shows poor data quality can mislead decisions and cause harmful outcomes.
In data warehouse projects, 50% are cancelled or delayed due to data quality.
Data errors cause economic losses amounting to about 6% of U.S. GDP annually.
In the U.S., data‑related medical errors result in approximately 98,000 deaths each year.
In telecommunications, data errors lead to fault‑resolution delays, unnecessary equipment rentals, and billing mistakes, damaging reputation.
U.S. retail suffers $2.5 billion losses annually from pricing errors.
Credit‑card fraud caused by data quality issues cost $4.8 billion in 2008.
Jim Barker classifies data quality problems into two types:
Type 1: Simple, obvious issues detectable by automated tools.
Type 2: Hidden issues that require specific contexts to detect and cannot be handled by tools alone.
Type 1 problems involve checking completeness, consistency, uniqueness, and validity (e.g., an invalid gender code). Type 2 problems involve timeliness, consistency, and accuracy that often require domain knowledge (e.g., a retired employee still listed as active).
Resolving data quality issues requires a complex, strategic approach that combines automation with manual effort.
According to Barker:
Type 1 covers about 80% of data quality problems while consuming only 20% of budget.
Type 2 problems need multi‑party input and expert analysis to identify and correct.
National Standard Data Quality Evaluation Indicators
The most authoritative standard is GB/T36344‑2018 (ICS 35.24.01), which defines the following dimensions:
Normativity : conformity to data standards, models, business rules, metadata, or reference data.
Completeness : degree to which required data elements are populated.
Accuracy : correctness of data in representing real‑world entities.
Consistency : absence of contradictions with other data in specific contexts.
Timeliness : correctness of data with respect to time changes.
Accessibility : ease of accessing the data.
Additional industry‑recognized metrics include Uniqueness , Stability , and Trustworthiness .
Causes of Data Quality Problems
Big data construction involves many complex steps—business analysis, standard definition, metadata management, data modeling, data aggregation, cleaning, storage, cataloging, sharing, maintenance, and deprecation. Errors at any stage cause data quality issues, and sometimes the source data itself is erroneous.
Technical factors include:
Poor data standard definition leading to inconsistent input and biased data.
Flawed data model design causing storage chaos, duplication, incompleteness, or inaccuracy.
Source data quality issues that are not cleaned during ingestion.
Inadequate data profiling before collection.
Incorrect data collection parameters and processes.
Improper data cleaning, transformation, and loading rules.
Business factors include:
Insufficient business understanding.
Changes in business processes affecting all downstream data handling.
Non‑standard data entry (case, width, special characters, etc.).
Siloed information systems.
Data falsification for performance metrics.
Management factors include:
Lack of talent and internal data‑management expertise.
Incomplete process management and no unified workflow for data quality issues.
Organizational culture that undervalues data quality.
Unclear accountability and reward/punishment mechanisms for data quality.
How to Solve Quality Problems
Big data projects must treat data quality as a professional, complex engineering challenge. Directly using raw data for business processes carries high risk because a single erroneous record can cause severe business impact.
The solution is divided into three phases: pre‑emptive prevention, in‑process monitoring, and post‑improvement.
Pre‑emptive Prevention
Establish a quality‑management mechanism with defined responsibilities, authority, indicators, and remediation processes.
Define data quality standards that combine national, industry, and organizational requirements.
Develop quality‑monitoring models that reflect business needs.
Create monitoring rules covering normativity, completeness, accuracy, consistency, timeliness, accessibility, etc.
In‑process Monitoring
Monitor raw data quality at the source, separating "good" and "bad" data and feeding back issues for source correction.
Monitor data‑center quality using checks such as null, range, logic, consistency, multi‑source comparison, data profiling, and outlier detection.
Provide timely feedback of identified issues to source owners and warehouse teams.
Assess data quality regularly to raise awareness and drive corrective actions.
Post‑Improvement
Repair identified data quality problems through manual work, tickets, or automation.
Collect data‑quality requirements from business units to create a feedback loop.
Continuously refine management policies, standards, monitoring models, and rules as business evolves.
Conclusion
To succeed in big data initiatives, organizations must prioritize data quality by fully understanding business needs, mastering the entire data‑construction lifecycle, and adopting a high‑level perspective to identify and resolve quality issues.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.