DVC: Data Version Control for Machine Learning Projects
DVC is an open‑source data version control system that extends Git to manage large machine‑learning models, datasets, and pipelines, enabling reproducible experiments, low‑friction branching, metric tracking, and seamless collaboration across various storage backends.
DVC Overview
DVC (Data Version Control) was created to make machine‑learning models shareable and reproducible. It is designed to handle large files, datasets, ML models, metrics, and code.
ML Project Version Control
DVC connects code, data, and models, storing file contents in remote locations such as Amazon S3, Azure Blob, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network‑mounted storage, or even optical disks.
Having complete code and data provenance helps track the full evolution of each ML model, guaranteeing reproducibility and easy switching between experiments.
ML Experiment Management
Leverage the full power of Git branches to try different ideas instead of relying on ad‑hoc file suffixes or comments. Automatic metric tracking replaces paper‑and‑pencil logs.
DVC is built to keep branching as simple and fast as Git, regardless of data file size. First‑class metrics and ML pipelines give projects a cleaner structure, making it easy to compare ideas, select the best, and speed up iteration with cached intermediate artifacts.
Deployment and Collaboration
Use push/pull commands to move consistent ML models, data, and code packages to production, remote machines, or teammates’ computers, avoiding temporary scripts.
DVC introduces lightweight pipelines as first‑class citizens in Git. Language‑agnostic pipelines connect multiple steps into a DAG, reducing friction when moving code to production.
Key Features
Git Compatibility
DVC runs on any Git repository and works with standard Git servers or providers (GitHub, GitLab, etc.). Data files can be shared via network‑accessible storage or any supported cloud solution, offering all benefits of distributed version control—no locks, local branches, and versioning.
Storage‑Agnostic
Supports Amazon S3, Azure Blob, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network‑mounted storage, and optical disks, with the list of remote storages continuously expanding.
Reproducibility
A single dvc repro command reproduces an experiment end‑to‑end by consistently maintaining the combination of input data, configuration, and the original code used to run the experiment.
Low‑Friction Branching
DVC fully supports instant Git branching even for large files. Branches naturally reflect the non‑linear, highly iterative nature of ML workflows, allowing many experiments to share the same data version while preserving full history.
Metric Tracking
Metrics are first‑class citizens in DVC. A built‑in command lists all branches and their metrics, helping track progress and select the best version.
ML Pipeline Framework
DVC provides a built‑in way to connect ML steps into a DAG and run the entire pipeline end‑to‑end. Intermediate results are cached, so unchanged inputs or code skip re‑execution.
Language & Framework Agnostic
Regardless of programming language or library (Python, R, Julia, Scala Spark, custom binaries, notebooks, TensorFlow, PyTorch, etc.), reproducibility and pipelines rely on input and output files or directories.
HDFS, Hive, and Apache Spark
DVC can include Spark and Hive jobs alongside local ML steps, or manage Spark and Hive jobs end‑to‑end. Breaking heavy cluster jobs into smaller DVC pipeline steps greatly reduces feedback loops and allows independent iteration.
Fault Tracking
Recording failed attempts preserves knowledge that can inspire future ideas and saves time; DVC tracks everything in a reproducible, accessible manner.
Use Cases
Save and Reproduce Experiments
At any time, retrieve the full content of your or a teammate’s experiment. DVC ensures all files and metrics are consistent, enabling you to copy an experiment or use it as a baseline for new iterations.
Version Control Models and Data
DVC stores meta‑files in Git (instead of Google Docs) to describe and control versions of datasets and models, supporting multiple external storage types as remote caches for large files.
Workflow for Deployment and Collaboration
DVC defines rules and processes for teams to work efficiently and consistently, serving as a protocol for collaboration, result sharing, and obtaining/running completed models in production environments.
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.