Motivation
As a data scientist, it’s common to experiment with various combinations of code, data, and models. To ensure that past experiments can be reproduced, it’s crucial to version control all of these elements.
Git is a great tool for version controlling code, but it is not ideal for versioning data and models due to three main issues:
- Git is not designed to handle large files: Git is optimized for only text files. Storing large binary files in Git can significantly increase repository sizes and slow down the performance of Git commands.
- Difficult to track changes: When you make changes to binary files, Git does not show a difference between the two versions, but instead stores a completely new version of the file. This can make it difficult to track changes to your data over time.
Wouldn’t it be nice if you could store your data on your favorite storage services such as Amazon S3, Google Drive, and Google Cloud Storage while still being able to version control your data? That is when DVC comes in handy.
Feel free to play with the code in this article here.
What is DVC?
DVC is a system for data version control. It is essentially like Git but is used for data. DVC allows you to store your original data in a separate location while keeping track of different versions of the data in Git.
Better yet, DVC syntax is just like Git! If you already know Git, learning DVC is a breeze.
To understand how to use DVC, let’s start with an example.
Start with installing the package with pip:
pip install dvc
or with conda:
conda install -c conda-forge dvc
Find the instruction on more ways to install DVC here.
Get Started
After DVC is installed, in a Git project, initialize it by running:
dvc init
Here is the structure of my data directory:
Track Data
To track changes to the data
directory, type:
dvc add data
This command will create a file named data.dvc
, which contains a unique identifier and the location of the data
directory in the file system. These details enable DVC to track changes to the directory over time.
outs:
- md5: 86451bd526f5f95760f0b7a412508746.dir
path: data
Commit the data.dvc
to Git to keep track of the data associated with a particular code version:
git add data.dvc
git commit -m "add data"
Store the Data Remotely
Next, we will store the data in a remote storage. DVC supports various storage services, such as Google Drive, Amazon S3, Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP. In this article, we will store our data on Google Drive.
Start with creating a folder on Google Drive and get the link to the folder:
Once we have the link, we can add it to DVC using the command dvc remote add -d remote gdrive://<link>
.
For example, if the link is https://drive.google.com/drive/folders/1ynNBbT-4J0ida0eKYQqZZbC93juUUbVH, then the command to add it to DVC is:
dvc remote add -d remote gdrive://1ynNBbT-4J0ida0eKYQqZZbC93juUUbVH
The -d
option sets the remote as the default. The remote’s information is saved in the .dvc/config
file.
[core]
remote = remote
['remote "remote"']
url = gdrive://1ynNBbT-4J0ida0eKYQqZZbC93juUUbVH
Commit the config file to keep track of the remote storage location:
git commit .dvc/config -m "Configure remote storage"
Finally, push the data to the remote storage using:
dvc push
That’s it! Now all of the data is pushed to Google Drive.
To push all commited changes to our remote Git repository, type:
git push origin <branch>
Checkout this documentation for more ways to store your data in other storage services.
Get the Data
To get the data associated with the latest code version, first pull the changes from Git using:
git pull origin <branch>
Once you you have the updated .dvc
file in your local directory, you can download the data associated with the .dvc
file from the remote storage to your local machine.
dvc pull
Switch between Different Versions
To switch to a data version that is associated with a code version, you can use Git and DVC.
First, switch to the desired cofe version in Git using:
git checkout <version>
Next, switch to the corresponding data version using:
dvc checkout
For example, to switch to the previous version of the data, type:
git checkout HEAD^1 data.dvc
dvc checkout
Conclusion
Congratulations! You have just learned how to use DVC to store and version your data. In summary,
dvc add
tracks the filedvc push
pushes the data to the remote storagedvc pull
pulls the data from the remote storagedvc checkout
s downloads the other versions of your data