The Challenges of Version Controlling Notebooks
Notebooks are complex objects that contain code, output, and metadata, making them challenging for traditional version control systems like Git to manage. Specific issues include:
- Output cells: Output cells can be large and change frequently, making it difficult to track changes.
- Dependencies: Notebooks often rely on external libraries and packages, which can be difficult to manage.
- Inconsistent code formatting: Inconsistent formatting can lead to unnecessary changes and make version control more complicated.
To address this issue, we’ll explore three version control tools specifically designed for notebooks:
- pipreqsnb
- nbstripout
- nbqa
These tools aim to provide a more effective way to manage notebook versions and simplify collaboration.
nbstripout: A Tool for Stripping Output Cells
nbstripout is a tool that strips output cells from notebooks, making it easier to track changes. By removing output cells, nbstripout reduces the noise in the diff and makes it easier to focus on code changes.
To install nbstripout, run:
pip install nbstripout
To use nbstripout, simply run it on your notebook:
nbstripout my_notebook.ipynb
Output:
nbqa: A Tool for Quality Assurance and Version Control
nbqa is a tool that checks the quality of the code in your Jupyter Notebook and automatically formats it. With nbqa, you can run isort, black, flake8, and more on your Jupyter Notebooks.
To install nbQA, run:
pip install nbqa
Let’s take an example notebook example_notebook.ipynb
that looks like this:
import pandas as pd
import numpy as np
a = [1,2,3,4]
Format the code using nbqa:
nbqa black example_notebook.ipynb
Output:
All done! ✨ 🍰 ✨
1 file left unchanged.
Check the style and quality of the code using nbqa:
nbqa flake8 example_notebook.ipynb
Output:
example_notebook.ipynb:cell_1:1:1: F401 'pandas as pd' imported but unused
example_notebook.ipynb:cell_1:3:1: F401 'numpy as np' imported but unused
Sort the imports in the notebook using nbqa:
nbqa isort example_notebook.ipynb
Output:
Fixing /home/khuyen/book/book/Chapter7/example_notebook.ipynb
After running all of these commands, the notebook looks much cleaner:
import numpy as np
import pandas as pd
a = [1, 2, 3, 4]
To automate the process, you can configure nbqa to run automatically every time you commit a Jupyter Notebook using pre-commit. Here’s an example pre-commit-config.yaml
file:
# pre-commit-config.yaml
repos:
- repo: https://github.com/nbQA-dev/nbQA
rev: 0.10.0
hooks:
- id: nbqa-flake8
- id: nbqa-isort
- id: nbqa-black
pipreqsnb: A Tool for Managing Dependencies
pipreqsnb is a tool that generates a requirements.txt
file based on the imports in your Jupyter Notebooks. This is useful for managing dependencies and ensuring that your notebooks are reproducible.
To use pipreqsnb, simply run it on your notebook directory:
pipreqsnb .
Output:
pipreqs .
INFO: Successfully saved requirements file in ./requirements.txt
The resulting requirements.txt
file will look something like this:
pandas==1.3.4
numpy==1.20.3
ipython==7.30.1
scikit_learn==1.0.2
Conclusion
Version controlling notebooks can be complex, but tools like nbstripout, nbqa, and pipreqsnb simplify the process. These tools help you:
- Get more accurate and informative diffs
- Filter out output cells
- Format code consistently
- Manage dependencies effectively
By taking care of these tasks, these tools allow you to focus on code changes and collaborate with others more efficiently.