Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

3 Essential Tools for Version Controlling Jupyter Notebooks

Table of Contents

3 Essential Tools for Version Controlling Jupyter Notebooks

The Challenges of Version Controlling Notebooks

Notebooks are complex objects that contain code, output, and metadata, making them challenging for traditional version control systems like Git to manage. Specific issues include:

  • Output cells: Output cells can be large and change frequently, making it difficult to track changes.
  • Dependencies: Notebooks often rely on external libraries and packages, which can be difficult to manage.
  • Inconsistent code formatting: Inconsistent formatting can lead to unnecessary changes and make version control more complicated.

To address this issue, we’ll explore three version control tools specifically designed for notebooks:

  • pipreqsnb
  • nbstripout
  • nbqa

These tools aim to provide a more effective way to manage notebook versions and simplify collaboration.

nbstripout: A Tool for Stripping Output Cells

nbstripout is a tool that strips output cells from notebooks, making it easier to track changes. By removing output cells, nbstripout reduces the noise in the diff and makes it easier to focus on code changes.

To install nbstripout, run:

pip install nbstripout

To use nbstripout, simply run it on your notebook:

nbstripout my_notebook.ipynb

Output:

nbqa: A Tool for Quality Assurance and Version Control

nbqa is a tool that checks the quality of the code in your Jupyter Notebook and automatically formats it. With nbqa, you can run isort, black, flake8, and more on your Jupyter Notebooks.

To install nbQA, run:

pip install nbqa

Let’s take an example notebook example_notebook.ipynb that looks like this:

import pandas as pd

import numpy as np

a = [1,2,3,4]

Format the code using nbqa:

nbqa black example_notebook.ipynb

Output:

All done! ✨ 🍰 ✨
1 file left unchanged.

Check the style and quality of the code using nbqa:

nbqa flake8 example_notebook.ipynb

Output:

example_notebook.ipynb:cell_1:1:1: F401 'pandas as pd' imported but unused
example_notebook.ipynb:cell_1:3:1: F401 'numpy as np' imported but unused

Sort the imports in the notebook using nbqa:

nbqa isort example_notebook.ipynb

Output:

Fixing /home/khuyen/book/book/Chapter7/example_notebook.ipynb

After running all of these commands, the notebook looks much cleaner:

import numpy as np
import pandas as pd

a = [1, 2, 3, 4]

To automate the process, you can configure nbqa to run automatically every time you commit a Jupyter Notebook using pre-commit. Here’s an example pre-commit-config.yaml file:

# pre-commit-config.yaml
repos:
- repo: https://github.com/nbQA-dev/nbQA
  rev: 0.10.0
  hooks:
    - id: nbqa-flake8
    - id: nbqa-isort
    - id: nbqa-black

pipreqsnb: A Tool for Managing Dependencies

pipreqsnb is a tool that generates a requirements.txt file based on the imports in your Jupyter Notebooks. This is useful for managing dependencies and ensuring that your notebooks are reproducible.

To use pipreqsnb, simply run it on your notebook directory:

pipreqsnb . 

Output:

pipreqs  .
INFO: Successfully saved requirements file in ./requirements.txt

The resulting requirements.txt file will look something like this:

pandas==1.3.4
numpy==1.20.3
ipython==7.30.1
scikit_learn==1.0.2

Conclusion

Version controlling notebooks can be complex, but tools like nbstripout, nbqa, and pipreqsnb simplify the process. These tools help you:

  • Get more accurate and informative diffs
  • Filter out output cells
  • Format code consistently
  • Manage dependencies effectively

By taking care of these tasks, these tools allow you to focus on code changes and collaborate with others more efficiently.

References

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran