Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

How to Structure a Data Science Project for Maintainability

Table of Contents

How to Structure a Data Science Project for Maintainability

Why You Need a Maintainable Data Science Project

Building a maintainable data science project is essential for keeping your work clear, extensible, and reproducible over time. Without a strong structure, projects tend to become messy, leading to duplicated code, complicated debugging, and frustration when new collaborators join

Establishing an effective structure from the beginning addresses many of these issues. However, designing that structure from scratch can be challenging.

Introducing the Data Science Template

To solve this, I developed the data science template, a ready-to-use framework that embeds industry best practices. This template results from my years refining how data science projects should be organized to be clean, transparent, and collaborative.

I created this repo years ago. I recently updated the template to include uv, a lightning-fast Python package installer, and refreshed versions of core libraries to ensure compatibility with the latest tools.

The tools used in this template are:

  • Poetry: A tool for dependency management and packaging in Python that creates isolated environments.
  • uv: An extremely fast Python package installer and resolver, an alternative to pip and poetry’s installer.
  • pip: The standard package installer for Python, used to install and manage libraries.
  • Hydra: A flexible framework to manage configurations for complex applications, particularly useful for data science and machine learning projects.
  • Pre-commit plugins: A framework for managing and maintaining multi-language pre-commit hooks to ensure code quality.
  • Pdoc: A simple yet powerful tool for generating API documentation for Python projects.

Feel free to play and fork the source code of this article here:

Get Started

To download the template, start with installing Cookiecutter:

pip install cookiecutter

Create a project based on the template:

cookiecutter https://github.com/khuyentran1401/data-science-template

…, and you will be prompted to specify important project details such as project name, author, preferred dependency manager, and compatible Python versions at the time of creation:

You've downloaded /Users/khuyentran/.cookiecutters/data-science-template before. Is it okay to delete and re-download it? [y/n] (y): 
  [1/4] project_name (Project Name): customer_segmentation
  [2/4] author_name (Your Name): Khuyen Tran
  [3/4] Select dependency_manager
    1 - pip
    2 - poetry
    3 - uv
    Choose from [1/2/3] (1): 3
  [4/4] compatible_python_versions (>=3.9): 

Now a project with the specified name is created in your current directory! If you choose uv as the dependency management tool, the structure of the project should look like the below:

.
├── config
│   ├── main.yaml                   # Main configuration file
│   ├── model                       # Configurations for training model
│   │   ├── model1.yaml             # First variation of parameters to train model
│   │   └── model2.yaml             # Second variation of parameters to train model
│   └── process                     # Configurations for processing data
│       ├── process1.yaml           # First variation of parameters to process data
│       └── process2.yaml           # Second variation of parameters to process data
├── data
│   ├── final                       # data after training the model
│   ├── processed                   # data after processing
│   └── raw                         # raw data
├── docs                            # documentation for your project
├── .gitignore                      # ignore files that cannot commit to Git
├── models                          # store models
├── notebooks                       # store notebooks
├── .pre-commit-config.yaml         # configurations for pre-commit
├── .python-version                 # specify Python version for the project
├── pyproject.toml                  # project metadata and dependencies
├── README.md                       # describe your project
├── src                             # store source code
│   ├── __init__.py                 # make src a Python module
│   ├── process.py                  # process data before training model
│   ├── train_model.py              # train model
│   └── utils.py                    # store helper functions
└── tests                           # store tests
    ├── __init__.py                 # make tests a Python module
    ├── test_process.py             # test functions for process.py
    └── test_train_model.py         # test functions for train_model.py

Install Dependencies

Depending on which dependency manager you selected when setting up the project, follow the instructions below:

ToolSpeedVirtual Environment SupportEase of UseBest For
pipStandard speedNoVery easySimple projects, maximum compatibility
PoetryModerateYesEasyManaging isolated environments and packaging
uvExtremely fastYesEasyFast installation and modern dependency management

pip

pip is the standard Python package installer.

First, create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On macOS/Linux
.venv\Scripts\activate     # On Windows

To install all dependencies, run:

pip install -r requirements-dev.txt

To install only production dependencies, run:

pip install -r requirements.txt

To add a new library, run:

pip install <library-name>

To remove a library, run:

pip uninstall <library-name>

To run a Python script, run:

python3 src/process.py

Poetry

Poetry is a dependency management tool that simplifies the creation of isolated environments and manages your packages efficiently.

To activate the virtual environment, run:

poetry shell

To install all project dependencies from pyproject.toml, run:

poetry install

To install only production dependencies, run:

poetry install --only main

To add a new library to your project, run:

poetry add <library-name>

To remove a library, run:

poetry remove <library-name>

To run a Python script, run:

poetry run python src/process.py

uv

uv is an ultra-fast Python package installer and resolver.

To install all dependencies from pyproject.toml, run:

uv sync --all-extras

To install only production dependencies, run:

uv sync

To add a new package, run:

uv add <library-name>

To remove a package, run:

uv remove <library-name>

To run a Python script, run:

uv run src/process.py

Check Issues in Your Code Before Committing

Maintaining code quality manually can be tedious and error-prone. Without automated checks, it’s easy to accidentally commit code with formatting inconsistencies, linting errors, or type mismatches. Over time, these minor issues can accumulate, making your codebase harder to maintain and collaborate on.

To avoid these problems, use Pre-commit to automate code checking before commits.

Pre-commit uses a .pre-commit-config.yaml file to define hooks that automatically run when you make a commit, checking your code for formatting, linting, and typing issues before the commit is accepted.

This template uses the following hooks:

  • Ruff: An extremely fast Python linter, written in Rust. It supports 500 lint rules, many of which are inspired by popular tools like Flake8, isort, pyupgrade, and others.
  • black is a code formatter in Python.
  • <a href="http://mypy-lang.org/">mypy</a>: Static type checking for Python to catch type-related errors before runtime
repos:
  - repo: https://github.com/charliermarsh/ruff-pre-commit
    rev: v0.11.6
    hooks:
      - id: ruff
        args: [--fix]
  - repo: https://github.com/psf/black
    rev: 25.1.0
    hooks:
      - id: black
  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.15.0
    hooks:
      - id: mypy

To add pre-commit to git hooks, type:

pre-commit install

Now, whenever you run git commit, your code will be automatically checked and reformatted before being committed.

git commit -m 'my commit'

Output:

ruff.......................................Failed
- hook id: ruff
- exit code: 1
Found 3 errors (3 fixed, 0 remaining).
black.......................................Passed
mypy.......................................Failed
- src/process.py:13: error: Incompatible types in assignment
  (expression has type "int", variable has type "str")

[ERROR] `mypy` failed. Fix errors and try again.

Manage Configuration Files with Hydra

It is essential to avoid hard-coding as it gives rise to various issues:

  • Maintainability: If values are scattered throughout the codebase, updating them consistently becomes harder. This can lead to errors or inconsistencies when values must be updated.
  • Reusability: Hardcoding values limits the reusability of code for different scenarios.
  • Security concerns: Hardcoding sensitive information like passwords or API keys directly into the code can be a security risk. 
  • Testing and debugging: Hardcoded values can make testing and debugging more challenging.

Configuration files solve these problems by storing parameters separately from the code, which enhances code maintainability and reproducibility.

Among the numerous Python libraries available for creating configuration files, Hydra stands out as my preferred configuration management tool because of its impressive set of features, including:

  • Convenient parameter access
  • Command-line configuration override

and more.

Suppose the “main.yaml” file located in the “config” folder looks like this:

# config/main.yaml
process:
  use_columns:
  - col1
  - col2
model:
  name: model1
data:
  raw: data/raw/sample.csv
  processed: data/processed/processed.csv
  final: data/final/final.csv

Within a Python script, you can effortlessly access a configuration file by applying a single decorator to your Python function. By using the dot notation (e.g., config.data.raw), you can conveniently access specific parameters from the configuration file.

import hydra
from omegaconf import DictConfig


@hydra.main(config_path="../config", config_name="main", version_base="1.2")
def process_data(config: DictConfig):
    """Function to process the data"""

    print(f"Process data using {config.data.raw}")
    print(f"Columns used: {config.process.use_columns}")


if __name__ == "__main__":
    process_data()

To override configuration from the command line, type:

python src/processs_data.py data.raw=sample2.csv

Output:

Process data using sample2.csv
Columns used: ['col1', 'col2']

Add API Documentation

As data scientists often collaborate with other team members on projects, it is important to create comprehensive documentation for these projects. However, manually creating documentation can be time-consuming.

This template leverages pdoc to generate API documentation based on the docstrings of your Python files and objects.

To generate static documentation, run:

pdoc src -o docs

To start documentation server (available at http://localhost:8080), run:

pdoc src --http localhost:8080


Navigate to http://localhost:8080 to access the documentation in your web browser.

Conclusion

Congratulations! You have just learned how to structure your data science project using a data science template. This template is designed to be flexible. Feel free to adjust the project based on your applications.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran