Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Version Control for Data and Models Using DVC

Table of Contents

Version Control for Data and Models Using DVC

Motivation

As a data scientist, you’re constantly iterating on datasets, model configurations, and code. Reproducing past experiments becomes difficult without a system for data version control that keeps data and models in sync with code changes.

Git excels at versioning source code, but it’s not well-suited for tracking data and models for two major limitations:

  • Storing big binaries inflates the repository and slows down Git operations.
  • Changes in binary files can’t be meaningfully tracked.

DVC (Data Version Control) fills this gap by extending Git to handle data and models efficiently. This article shows how to use DVC to:

  • Track and store datasets alongside Git-managed code
  • Build reproducible pipelines and log models with MLflow

The source code of this article can be found here:

What Is DVC?

DVC is an open-source tool that brings data and model versioning into Git workflows. Instead of storing bulky files directly in Git, DVC saves them in external storage and tracks their metadata with lightweight .dvc files.

If you’re comfortable with Git, DVC will feel familiar.

Install it using either of the following options:

  • Using pip
pip install dvc
uv add dvc

Getting Started

Initialize DVC inside an existing Git repository:

dvc init

After running dvc init, DVC sets up the project with the necessary configuration to start tracking data. Your directory structure will look like this:

.
├── .dvc/                 # DVC config and internal files
├── .dvcignore            # Like .gitignore but for DVC operations
├── .git/                 # Git repository
└── (your project files)

Tracking Data

Assume you have a data/ directory with your raw files.

To start tracking it:

dvc add data/

This creates a data.dvc file with metadata like:

outs:
- md5: 86451bd526f5f95760f0b7a412508746.dir
  path: data

Then commit the metadata to Git:

git add data.dvc .gitignore
git commit -m "Track dataset with DVC"

When you run dvc add data/, DVC also creates or updates a .gitignore file to prevent Git from tracking the actual data/ directory.

The directory structure after running dvc add data/ looks like this:

.
├── data/                  # Contains your actual dataset (ignored by Git)
├── data.dvc               # Metadata file tracked by Git
└── .gitignore             # Contains an entry to ignore /data/

The .gitignore file will include an entry like:

data/

This ensures that only the lightweight .dvc metadata file is versioned, while the large data files are managed separately through DVC’s external storage system.

Storing Data Remotely

DVC supports many storage backends like S3, GCS, Azure, SSH, and Google Drive.

To use Amazon S3:

  1. Make sure your AWS credentials are configured (e.g. with aws configure)
  2. Create or choose an existing S3 bucket (e.g. my-dvc-bucket)

Then configure the remote:

dvc remote add -d myremote s3://my-dvc-bucket/path/to/data

This saves a remote entry in .dvc/config:

[core]
    remote = myremote

['remote "myremote"']
    url = s3://my-dvc-bucket/path/to/data

Commit the config:

git add .dvc/config
git commit -m "Configure S3 remote for DVC"

Then push your data:

dvc push

The actual data goes to Amazon S3. The only files stays in your Git repo are:

  • .dvc/: a directory that stores DVC configuration, cache, and metadata files.
  • data.dvc: the metadata file tracking your raw data directory

Retrieving Data

Suppose you just joined a project that uses DVC to manage datasets and model files. After cloning the Git repository, you might only see .dvc files and pipeline definitions, but not the actual data content.

For example:

.
└── data/
    └── raw.dvc

The .dvc file contains metadata pointing to the data stored in a remote location. To download and restore the full dataset locally, simply run:

dvc pull

This command downloads the required files from the configured remote storage and rebuilds the full directory structure:

.
└── data/
    ├── final/
    │   └── segmented.csv
    ├── intermediate/
    │   └── scale_features.csv
    ├── raw/
    │   └── marketing_campaign.csv
    └── raw.dvc

Switching Between Versions

Without a reliable workflow, it’s easy to accidentally pair the wrong version of code with the wrong version of data, leading to results you can’t reproduce or trust.

# Example of mismatch: Code expects 'feature' to be in range [1, 2, 3], but data has changed

model_input = pd.read_csv("data.csv")
assert model_input['feature'].max() <= 3, "Feature values exceed expected range"

DVC’s dvc checkout command makes it easy to switch between data and model versions tied to specific Git commits or branches.

To demonstrate, let’s track and switch between two dataset versions.

First, create the initial version of the dataset using a Python script:

# example.py
import pandas as pd

# Create version 1 of dataset
df_v1 = pd.DataFrame({"feature": [1, 2, 3], "target": [0, 1, 0]})
df_v1.to_csv("data.csv", index=False)

From the terminal, run the script and track the dataset with DVC:

python example.py
dvc add data.csv
git add data.csv.dvc .gitignore
git commit -m "Version 1 of data"

Next, simulate an updated dataset version:

# example.py
import pandas as pd

# Overwrite data.csv with version 2
df_v2 = pd.DataFrame({"feature": [10, 20, 30], "target": [1, 0, 1]})
df_v2.to_csv("data.csv", index=False)

Track the updated dataset:

dvc add data.csv
git add data.csv.dvc
git commit -m "Version 2 of data"

Now switch back to version 1:

git checkout HEAD~1
dvc checkout

This restores data.csv to its original state:

FeatureTarget
10
21
30

As shown, the dataset now reflects version 1 again, keeping your data aligned with the code at that point in history.

Building a DVC Pipeline

Beyond tracking data, DVC allows you to create reproducible machine learning pipelines that connect stages like preprocessing and training.

Here’s an example of a pipeline defined in dvc.yaml with two stages: process_data and train.

stages:
  process_data:
    cmd: python src/process_data.py
    deps:
    - data/raw
    - src/process_data.py
    - config
    outs:
    - data/intermediate
  train:
    cmd: python src/segment.py
    deps:
    - data/intermediate
    - src/segment.py
    - config
    outs:
    - data/final
    - model/cluster.pkl

dvc.yaml defines each pipeline stage and how DVC should execute and track it. The file includes:

  • stages: A top-level section that holds each named stage of the pipeline
  • Each stage name (e.g. process_data, train) maps to one pipeline step
  • cmd: The command DVC should run for that stage
  • deps: Dependencies the stage needs, such as data files, Python scripts, or configuration files
  • outs: Outputs the stage generates, which DVC will version and manage automatically

Run the entire pipeline with:

dvc repro

DVC will only re-run stages whose inputs or dependencies have changed.

For example, changing the code in src/segment.py will trigger only the affected stage:

def get_pca_model(data: pd.DataFrame) -> PCA:
    pca = PCA(n_components=4)  # changed from 3 to 4
    pca.fit(data)
    return pca

Then run:

dvc repro

DVC output:

'data/raw.dvc' didn't change, skipping
Stage 'process_data' didn't change, skipping
Running stage 'train':
> python src/segment.py

Only the train stage is re-executed. This targeted re-execution improves pipeline efficiency and preserves consistency.

You can visualize the pipeline with:

dvc dag

This shows a graph of your pipeline stages and their relationships.

+--------------+ 
| data/raw.dvc | 
+--------------+ 
        *        
        *        
        *        
+--------------+ 
| process_data | 
+--------------+ 
        *        
        *        
        *        
    +-------+    
    | train |    
    +-------+    

DVC vs MLflow: Roles and Integration

DVC and MLflow both support reproducibility in machine learning projects, but they target different aspects of the workflow:

  • DVC focuses on version-controlling datasets, models, and pipelines. It ensures consistency and scalability by integrating tightly with Git and external storage.
  • MLflow is built for experiment management—logging parameters, metrics, and artifacts to compare different runs.

Key Differences

FeatureDVCMLflow
Primary FocusData and model file versioningExperiment tracking and model registry
StorageExternal file systems (S3, etc)Local filesystem, S3, Azure, GCS
Git IntegrationYesOptional
Pipeline SupportYes (dvc.yaml)No
Metrics TrackingBasic (.json, .tsv)Extensive (via mlflow.log_*)
Web UINo (3rd-party only)Yes (mlflow ui)

Want to dive deeper into model logging?
Read this article comparing MLflow vs Pickle for tracking models →

How to Integrate DVC and MLflow

You can combine DVC and MLflow for a powerful, end-to-end MLOps workflow:

  • Use DVC to manage and version datasets, models, and pipeline outputs
  • Use MLflow to log parameters, metrics, and artifacts for experiment tracking and comparison

This code shows how to integrate DVC and MLflow in a pipeline stage defined in segment.py.

# Save artifacts to disk so DVC can version and track them
save_data_and_model(data, model, config)

# Log run metadata, metrics, and models with MLflow
with mlflow.start_run():

    mlflow.log_params({"n_components": 3, "random_state": 42, "best_k": k_best})
    mlflow.log_metric("silhouette_score", silhouette_avg)
    signature = infer_signature(pca_df, pred)
    mlflow.sklearn.log_model(
        model, "kmeans_model", signature=signature, input_example=pca_df.head()
    )
    mlflow.log_artifact(config.final.path, "processed_data")

Reproduce the pipeline stages and sync the outputs to remote storage with DVC:

dvc repro
dvc push

This integration ensures data and code are reproducible via DVC, while experiment metadata is logged in MLflow.

Automating DVC with Git Hooks

Without automation, it’s easy to forget critical DVC steps—like running dvc push after updating data, or dvc checkout after switching Git branches.

DVC integrates with pre-commit to streamline automation of DVC operations like syncing, pushing, and restoring data.

You can configure DVC’s pre-commit integration automatically with:

dvc install --use-pre-commit-tool

Or manually add the following to your .pre-commit-config.yaml:

repos:
- repo: https://github.com/iterative/dvc
  rev: 3.59.2
  hooks:
  - id: dvc-pre-commit
    additional_dependencies:
    - .[all]
    language_version: python3
    stages:
    - pre-commit
  - id: dvc-pre-push
    additional_dependencies:
    - .[all]
    language_version: python3
    stages:
    - pre-push
  - id: dvc-post-checkout
    additional_dependencies:
    - .[all]
    language_version: python3
    stages:
    - post-checkout
    always_run: true

The pre-commit framework only installs the pre-commit hook by default. To enable the pre-push and post-checkout hooks, run:

pre-commit install --hook-type pre-push --hook-type post-checkout --hook-type pre-commit

Try It Out

Once you’ve set up Git hooks (manually or using pre-commit), you can verify they work with this simple test:

  1. Modify a tracked file (e.g. update a notebook or a script).
  2. Stage and commit the change:
git add src/train.py
git commit -m "Update training logic"

If the pre-commit hook is installed correctly, you’ll see output showing the result of running dvc status, such as:

DVC pre-commit....................................Passed
- hook id: dvc-pre-commit
- duration: 0.4s

process_data:                                                         
        changed deps:
                modified:           config
train:
        changed deps:
                modified:           config
  1. Push the commit:
git push

If the pre-push hook is active, you’ll see the output from dvc push, such as:

DVC pre-push....................................Passed

This confirms that the pre-push hook is uploading data and models using dvc push.

Summary

DVC bridges the gap between Git and data versioning:

  • dvc add: track data or models
  • dvc repro: run the pipeline and only re-execute the changed stages
  • dvc push: store files in remote storage
  • dvc pull: retrieve files from remote
  • dvc checkout: sync data versions with Git history

1 thought on “Version Control for Data and Models Using DVC”

  1. One convenient way to integrate DVC and MLflow in a single platform is DagsHub. It combines both of them with a convenient UI, Label Studio for data annotation, and an entire data engine for curation and validation of data (I’m biased, but you might want to check it out).

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran