3 Essential Tools for Version Controlling Jupyter Notebooks

The Challenges of Version Controlling Notebooks

Notebooks are complex objects that contain code, output, and metadata, making them challenging for traditional version control systems like Git to manage. Specific issues include:

  • Output cells: Output cells can be large and change frequently, making it difficult to track changes.
  • Dependencies: Notebooks often rely on external libraries and packages, which can be difficult to manage.
  • Inconsistent code formatting: Inconsistent formatting can lead to unnecessary changes and make version control more complicated.

To address this issue, we’ll explore three version control tools specifically designed for notebooks:

  • pipreqsnb
  • nbstripout
  • nbqa

These tools aim to provide a more effective way to manage notebook versions and simplify collaboration.

nbstripout: A Tool for Stripping Output Cells

nbstripout is a tool that strips output cells from notebooks, making it easier to track changes. By removing output cells, nbstripout reduces the noise in the diff and makes it easier to focus on code changes.

To install nbstripout, run:

pip install nbstripout

To use nbstripout, simply run it on your notebook:

nbstripout my_notebook.ipynb

Output:

nbqa: A Tool for Quality Assurance and Version Control

nbqa is a tool that checks the quality of the code in your Jupyter Notebook and automatically formats it. With nbqa, you can run isort, black, flake8, and more on your Jupyter Notebooks.

To install nbQA, run:

pip install nbqa

Let’s take an example notebook example_notebook.ipynb that looks like this:

import pandas as pd

import numpy as np

a = [1,2,3,4]

Format the code using nbqa:

nbqa black example_notebook.ipynb

Output:

All done! ✨ 🍰 ✨
1 file left unchanged.

Check the style and quality of the code using nbqa:

nbqa flake8 example_notebook.ipynb

Output:

example_notebook.ipynb:cell_1:1:1: F401 'pandas as pd' imported but unused
example_notebook.ipynb:cell_1:3:1: F401 'numpy as np' imported but unused

Sort the imports in the notebook using nbqa:

nbqa isort example_notebook.ipynb

Output:

Fixing /home/khuyen/book/book/Chapter7/example_notebook.ipynb

After running all of these commands, the notebook looks much cleaner:

import numpy as np
import pandas as pd

a = [1, 2, 3, 4]

To automate the process, you can configure nbqa to run automatically every time you commit a Jupyter Notebook using pre-commit. Here’s an example pre-commit-config.yaml file:

# pre-commit-config.yaml
repos:
- repo: https://github.com/nbQA-dev/nbQA
  rev: 0.10.0
  hooks:
    - id: nbqa-flake8
    - id: nbqa-isort
    - id: nbqa-black

pipreqsnb: A Tool for Managing Dependencies

pipreqsnb is a tool that generates a requirements.txt file based on the imports in your Jupyter Notebooks. This is useful for managing dependencies and ensuring that your notebooks are reproducible.

To use pipreqsnb, simply run it on your notebook directory:

pipreqsnb . 

Output:

pipreqs  .
INFO: Successfully saved requirements file in ./requirements.txt

The resulting requirements.txt file will look something like this:

pandas==1.3.4
numpy==1.20.3
ipython==7.30.1
scikit_learn==1.0.2

Conclusion

Version controlling notebooks can be complex, but tools like nbstripout, nbqa, and pipreqsnb simplify the process. These tools help you:

  • Get more accurate and informative diffs
  • Filter out output cells
  • Format code consistently
  • Manage dependencies effectively

By taking care of these tasks, these tools allow you to focus on code changes and collaborate with others more efficiently.

References

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran