Big Data 22 min read

Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration

This article explores agile data engineering, advocating code‑as‑infrastructure practices such as code‑everything, data and code reuse, and ETL‑level continuous integration, while discussing the trade‑offs between data‑centric and code‑centric reuse, cost considerations, and practical implementation tips for modern data projects.

DevOps

Jun 27, 2024

Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration

As data becomes increasingly integral to enterprises, data technologies have advanced rapidly, yet many data developers still face painful project development cycles, including complex calculations, hundreds of SQL statements, and pipeline delays.

The root cause often lies in insufficient data engineering practices. To scale enterprise data projects, robust data engineering practices must accompany evolving data technologies.

Data engineering is essentially software engineering applied to the data development domain, combining software construction knowledge, data technologies, and tooling to build complex data products. Agile data engineering extends agile software development principles to data projects, promoting iterative, collaborative, and measurable processes.

Key "code‑as‑everything" practices include:

Configuration as code – storing configuration files in version control.

Infrastructure as code – defining infrastructure with tools like Kubernetes using yaml or Terraform also using yaml.

Pipeline as code – defining CI/CD pipelines via code (e.g., Jenkins with Groovy, GitHub Actions, GitLab, CircleCI, Travis CI using yaml).

These practices enable traceable change history, easy rollback, and seamless integration with developers' daily workflows such as code reviews.

In data development, many resources can be codified: infrastructure, security configurations, ETL code, ETL task configurations, data pipelines, operational scripts, and business annotations. For example, infrastructure can be managed with Terraform or

Kubernetes

yaml

; security policies can be applied via APIs; ETL logic can be enhanced with tools like Easy SQL, which supports variables, logging, assertions, debugging, and an Include directive for modularity.

Data reuse (layered data warehouses such as ODS, DWD, and dimension layers) offers high-level sharing but suffers from reduced flexibility and difficulty tracing computation across layers. Code reuse—through functions, file includes, database views, and materialized views—provides finer‑grained modularity and better traceability.

Choosing a reuse strategy depends on workload characteristics: heavy‑weight ETL jobs benefit from data‑centric reuse to control resources, while lightweight jobs can adopt code‑centric reuse for flexibility. When uncertain, prioritize code‑centric reuse.

ETL‑level continuous integration addresses the inefficiency of monolithic pipelines. By parameterizing CI pipelines (e.g., a Jenkins parameter for the ETL file path) or building dedicated CI pipelines per ETL, teams can test and deploy only the affected ETL, reducing execution time and improving safety.

Implementing ETL‑level CI requires version tracking for each ETL, such as writing a version file to production during deployment.

In summary, adopting agile data engineering practices—code‑as‑everything, strategic reuse, and ETL‑level CI—can significantly improve data product delivery quality, while future work includes automated ETL testing, shorter ETL files, and end‑to‑end data capability teams.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Big Data agile development ETL Code as Infrastructure

Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.