Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration
This article explores agile data engineering, advocating code‑as‑infrastructure practices such as code‑everything, data and code reuse, and ETL‑level continuous integration, while discussing the trade‑offs between data‑centric and code‑centric reuse, cost considerations, and practical implementation tips for modern data projects.
As data becomes increasingly integral to enterprises, data technologies have advanced rapidly, yet many data developers still face painful project development cycles, including complex calculations, hundreds of SQL statements, and pipeline delays.
The root cause often lies in insufficient data engineering practices. To scale enterprise data projects, robust data engineering practices must accompany evolving data technologies.
Data engineering is essentially software engineering applied to the data development domain, combining software construction knowledge, data technologies, and tooling to build complex data products. Agile data engineering extends agile software development principles to data projects, promoting iterative, collaborative, and measurable processes.
Key "code‑as‑everything" practices include:
Configuration as code – storing configuration files in version control.
Infrastructure as code – defining infrastructure with tools like Kubernetes using yaml or Terraform also using yaml .
Pipeline as code – defining CI/CD pipelines via code (e.g., Jenkins with Groovy , GitHub Actions, GitLab, CircleCI, Travis CI using yaml ).
These practices enable traceable change history, easy rollback, and seamless integration with developers' daily workflows such as code reviews.
In data development, many resources can be codified: infrastructure, security configurations, ETL code, ETL task configurations, data pipelines, operational scripts, and business annotations. For example, infrastructure can be managed with Terraform or Kubernetes yaml ; security policies can be applied via APIs; ETL logic can be enhanced with tools like Easy SQL, which supports variables, logging, assertions, debugging, and an Include directive for modularity.
Data reuse (layered data warehouses such as ODS, DWD, and dimension layers) offers high-level sharing but suffers from reduced flexibility and difficulty tracing computation across layers. Code reuse—through functions, file includes, database views, and materialized views—provides finer‑grained modularity and better traceability.
Choosing a reuse strategy depends on workload characteristics: heavy‑weight ETL jobs benefit from data‑centric reuse to control resources, while lightweight jobs can adopt code‑centric reuse for flexibility. When uncertain, prioritize code‑centric reuse.
ETL‑level continuous integration addresses the inefficiency of monolithic pipelines. By parameterizing CI pipelines (e.g., a Jenkins parameter for the ETL file path) or building dedicated CI pipelines per ETL, teams can test and deploy only the affected ETL, reducing execution time and improving safety.
Implementing ETL‑level CI requires version tracking for each ETL, such as writing a version file to production during deployment.
In summary, adopting agile data engineering practices—code‑as‑everything, strategic reuse, and ETL‑level CI—can significantly improve data product delivery quality, while future work includes automated ETL testing, shorter ETL files, and end‑to‑end data capability teams.
DevOps
Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.