Big Data 15 min read

Building Data Production Pipelines with DataOps: Concepts, Practices, and a Six‑Stage Workflow

This article introduces DataOps, outlines its background and the problems it addresses, describes NetEase’s big‑data product ecosystem, and details a six‑stage data production pipeline—including coding, orchestration, testing, code review, release approval, and deployment – plus insights into two pipeline explorations.

DataFunSummit

Aug 28, 2023

Building Data Production Pipelines with DataOps: Concepts, Practices, and a Six‑Stage Workflow

DataOps, first proposed in 2014, was incorporated into Gartner’s data‑management maturity model in 2018 and gained further traction when China’s Information Technology Academy launched a DataOps standards working group in 2022 to promote industry development.

In March 2023, NetEase DataFun and the China IT Academy co‑established a DataOps innovation incubator to define standards and explore practical applications.

Gartner: DataOps is a collaborative data‑management practice focused on improving communication, integration, and automation of data flow between data managers and consumers.

IBM: DataOps combines people, processes, and technology to rapidly deliver trustworthy, high‑quality data to data citizens.

Wikipedia: DataOps merges comprehensive, process‑oriented data viewpoints with agile software‑engineering automation to enhance quality, speed, and collaboration, fostering continuous improvement in data analytics.

DataOps aims to solve several recurring issues: high manual dependency, slow demand response, inefficient development and operations, fragmented workflows, and difficult team collaboration.

NetEase’s big‑data journey began in 2006 with distributed databases and search, adopted Hadoop in 2009, launched the Mammoth platform and NetEase YouShu in 2014, commercialized big‑data services in 2017, and introduced a unified data‑mid‑platform in 2018, followed by the “Data Production Power” concept in 2020 and the Data Governance 2.0 solution in 2022.

The current product matrix spans from low‑level data computation and storage to a full DataOps‑driven lifecycle covering data integration, development, testing, and operations, as well as governance capabilities such as data maps, quality, standards, model design, and asset management, plus reporting, machine‑learning platforms, and CDP tagging.

Why a DataOps pipeline is needed? Real‑world incidents—such as production losses exceeding 300,000 CNY due to upstream task changes or missing configuration causing erroneous data and unintended rewards—show that 65 % of production issues stem from data‑development task changes.

Root causes include lack of full‑link impact analysis, missing release controls, insufficient automated testing, and fragile task dependencies.

Six stages of the DataOps pipeline:

1. Coding – Writing data‑processing code, handling new business scenarios, task modifications, and rollbacks, while leveraging code‑search, versioning, and shared resources (parameters, UDFs).

2. Orchestration – Building task‑dependency DAGs, with intelligent dependency‑recommendation based on code analysis and input‑table extraction.

3. Testing – Conducting data‑shape inspection, data comparison, sandbox isolation, code scanning, and mandatory testing to ensure data correctness before release.

4. Code Review – Senior architects or peers review business logic, warehouse standards, security, performance, and perform code diff for efficient approval.

5. Release Approval – Formal approval of SQL, scheduling, dependencies, and output tables, involving QA and data architects, with automated approval for low‑risk changes.

6. Deployment – Prioritized Yarn scheduling, baseline alerts, accelerators, intelligent diagnostics, freeze‑pool for abnormal tasks, and one‑click recovery.

Two pipeline explorations are discussed: (1) supporting multiple environments (test, production) for financial customers via a release‑center that packages resources, enables online publishing/pulling, and provides version comparison; (2) customizable release workflows to further improve data quality.

FAQ

Q1: How to test data correctness? – Use historical developer experience, cross‑validation, or configure rules in NetEase’s Data Quality Center for continuous monitoring.

Q2: When to use one‑click freeze? – For scenarios where upstream failures cause downstream blockage or subtle data loss that is hard to detect, freezing prevents error propagation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Quality DataOps

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.