DVC: Data Version Control Tool for Your Machine Learning Projects

Motivation

As a data scientist, it’s common to experiment with various combinations of code, data, and models. To ensure that past experiments can be reproduced, it’s crucial to version control all of these elements.

Git is a great tool for version controlling code, but it is not ideal for versioning data and models due to three main issues:

  • Git is not designed to handle large files: Git is optimized for only text files. Storing large binary files in Git can significantly increase repository sizes and slow down the performance of Git commands.
  • Difficult to track changes: When you make changes to binary files, Git does not show a difference between the two versions, but instead stores a completely new version of the file. This can make it difficult to track changes to your data over time.

Wouldn’t it be nice if you could store your data on your favorite storage services such as Amazon S3, Google Drive, and Google Cloud Storage while still being able to version control your data? That is when DVC comes in handy.

Feel free to play with the code in this article here.

What is DVC?

DVC is a system for data version control. It is essentially like Git but is used for data. DVC allows you to store your original data in a separate location while keeping track of different versions of the data in Git.

Better yet, DVC syntax is just like Git! If you already know Git, learning DVC is a breeze.

To understand how to use DVC, let’s start with an example.

Start with installing the package with pip:

pip install dvc

or with conda:

conda install -c conda-forge dvc

Find the instruction on more ways to install DVC here.

Get Started

After DVC is installed, in a Git project, initialize it by running:

dvc init

Here is the structure of my data directory:

Track Data

To track changes to the data directory, type:

dvc add data

This command will create a file named data.dvc, which contains a unique identifier and the location of the data directory in the file system. These details enable DVC to track changes to the directory over time.

outs:
- md5: 86451bd526f5f95760f0b7a412508746.dir
  path: data

Commit the data.dvc to Git to keep track of the data associated with a particular code version:

git add data.dvc
git commit -m "add data"

Store the Data Remotely

Next, we will store the data in a remote storage. DVC supports various storage services, such as Google Drive, Amazon S3, Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP. In this article, we will store our data on Google Drive.

Start with creating a folder on Google Drive and get the link to the folder:

Once we have the link, we can add it to DVC using the command dvc remote add -d remote gdrive://<link>.

For example, if the link is https://drive.google.com/drive/folders/1ynNBbT-4J0ida0eKYQqZZbC93juUUbVH, then the command to add it to DVC is:

dvc remote add -d remote gdrive://1ynNBbT-4J0ida0eKYQqZZbC93juUUbVH

The -d option sets the remote as the default. The remote’s information is saved in the .dvc/config file.

[core]
remote = remote
['remote "remote"']
url = gdrive://1ynNBbT-4J0ida0eKYQqZZbC93juUUbVH

Commit the config file to keep track of the remote storage location:

git commit .dvc/config -m "Configure remote storage"

Finally, push the data to the remote storage using:

dvc push

That’s it! Now all of the data is pushed to Google Drive.

To push all commited changes to our remote Git repository, type:

git push origin <branch>

Checkout this documentation for more ways to store your data in other storage services.

Get the Data

To get the data associated with the latest code version, first pull the changes from Git using:

git pull origin <branch>

Once you you have the updated .dvc file in your local directory, you can download the data associated with the .dvc file from the remote storage to your local machine.

dvc pull

Switch between Different Versions

To switch to a data version that is associated with a code version, you can use Git and DVC.

First, switch to the desired cofe version in Git using:

git checkout <version>

Next, switch to the corresponding data version using:

dvc checkout

For example, to switch to the previous version of the data, type:

git checkout HEAD^1 data.dvc
dvc checkout

Conclusion

Congratulations! You have just learned how to use DVC to store and version your data. In summary,

  • dvc add tracks the file
  • dvc push pushes the data to the remote storage
  • dvc pull pulls the data from the remote storage
  • dvc checkouts downloads the other versions of your data
Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran