Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Hydra for Python Configuration: Build Modular and Maintainable Pipelines

Table of Contents

Hydra for Python Configuration: Build Modular and Maintainable Pipelines

In data science projects, values like file names, features, split ratios, and hyperparameters often change. When these parameters are hard-coded, your code becomes inflexible and harder to maintain.

A better solution is to use a Python configuration system like Hydra config, which allows you to store settings in YAML files. This approach separates configuration from logic, making your code cleaner and more adaptable across different environments and experiments.

For a broader look at organizing your entire project structure—not just configs—see How to Structure a Data Science Project for Readability and Transparency.

The source code of this article can be found here:

Why You Should Avoid Hard-Coding

Here are four major problems caused by hard-coded parameters:

Maintainability

Manually updating the same parameter across different files or functions is tedious and error-prone. For example, hard-coding a value like split_ratio across multiple scripts can lead to mismatches. If one script updates the value but another doesn’t, the code runs inconsistently and is harder to debug:

# script1.py
split_ratio = 0.3  # updated value
# script2.py
split_ratio = 0.2  # outdated value, not updated

Reusability

Hardcoding values limits the reusability of code for different scenarios. For example, the script below is tied to a specific dataset through a hard-coded file path. To use a different dataset, you’d have to manually update the path every time, which is error-prone and slows down iteration.

# preprocess.py
input_file = "data/input_v1.csv"  #needs to be updated manually to "data/input_v2.csv"

Security

Hard-coding secrets like API keys, passwords, or database URLs directly into scripts can be a serious risk. The example below shows hard-coded database credentials. If this file is pushed to a shared repository, those credentials could be exposed and lead to unauthorized access to your database.

# config.py
db_user = "admin"
db_password = "pa55word"  # hard-coded database credentials

To handle secrets securely, consider storing them in environment variables or .env files. This guide explains how to manage sensitive information in Python using .env files.

Configuration Files to the Rescue

Configuration files help improve your workflow in the following ways:

Cleaner code and easier maintenance

Keeping configuration separate from logic makes scripts easier to read and maintain. You can change parameters without touching your core code.

# main.yaml
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate

cols_to_drop:
  - free sulfur dioxide
import pandas as pd
from omegaconf import OmegaConf

config = OmegaConf.load("main.yaml")

data = pd.read_csv(config.data.raw)
data = data.drop(columns=config.cols_to_drop)

Faster experimentation

Configuration files allow you to tweak parameters like features, splits, and hyperparameters without modifying the source code, enabling rapid iteration and experimentation.

# main.yaml

# Change from this
features: [age, income, education]

# To this without touching the source code
features: [age, income, education, credit_score]

Simplified deployment

With config files, adapting to different environments like development or production is straightforward. You can swap in the right settings without editing any logic.

# conf/database/dev.yaml
name: dev
db_url: sqlite:///dev.db
# conf/database/prod.yaml
name: prod
db_url: postgresql://prod_user:secure@prod.db.example.com/prod
# Run with dev settings
python main.py database=dev
​
# Run with prod settings
python main.py database=prod

Introduction to Hydra

Hydra is a modern Python library that simplifies how you structure and experiment with configurations. It helps you keep your code clean, flexible, and scalable by supporting:

  • Intuitive access to parameters via dot notation
  • Quick overrides from the command line for fast iteration
  • Logical grouping of configs to manage complexity
  • Multi-run execution to automate combinations of configurations

Installation

You can install Hydra using either pip or uv:

# Option 1: pip
pip install hydra-core

# Option 2: uv (faster alternative to pip and Python)
uv add hydra-core

Let’s explore how each of these features improves data science workflows.

Convenient Parameter Access

Suppose all configuration files are stored under the conf folder, and all Python scripts are stored under the src folder.

.
├── conf/
│   └── main.yaml
└── src/
    ├── process.py
    └── train_model.py

And the main.yaml file looks like this:

process:
  cols_to_drop:
  - free sulfur dioxide
  feature: quality
  test_size: 0.2
train:
  hyperparameters:
    svm__kernel:
    - rbf
    svm__C:
    - 0.1
    - 1
    - 10
    svm__gamma:
    - 0.1
    - 1
    - 10
  grid_search:
    cv: 2
    scoring: accuracy
    verbose: 3
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate
model: models

You can load a configuration file in your Python script by decorating your main function with @hydra.main, which tells Hydra where to find and how to apply the configuration.

from omegaconf import DictConfig
import hydra
​
@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
  ...

In the code above, config is an instance of DictConfig, a flexible and hierarchical configuration object provided by OmegaConf. It behaves like both a dictionary and an object, allowing you to access parameters using dot notation (config.key) or dictionary-style (config['key']):

# src/process.py
import hydra
from omegaconf import DictConfig
​
​
@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
    print("Accessing with bracket notation:", config["process"]["cols_to_drop"])
    print("Accessing with dot notation:", config.process.cols_to_drop)
​
​
if __name__ == "__main__":
    process_data()

Running this Python script is straightforward:

python src/process.py

Or use uv, a modern Python CLI tool that replaces pip and Python for running scripts:

uv run src/process.py

Output:

Accessing with bracket notation: ['free sulfur dioxide']
Accessing with dot notation: ['free sulfur dioxide']

This straightforward approach allows you to effortlessly retrieve the desired parameters.

Command-line configuration override

Let’s say you are experimenting with different test_size. It is time-consuming to repeatedly open your configuration file and modify the test_size value.

# conf/main.yaml
process:
  cols_to_drop:
    - free sulfur dioxide
  feature: quality
  test_size: 0.3  # previously 0.2

Luckily, Hydra makes it easy to directly overwrite the configuration from the command line.

Let’s try overriding a parameter at runtime. Start with the following conf/main.yaml configuration:

process:
  strategy: drop_missing
  cols_to_drop:
    - id
    - timestamp
    - customer_id
  impute_strategy: null
  feature: quality
  test_size: 0.2

Then define src/process.py as follows:

# src/process.py
import hydra
from omegaconf import DictConfig, OmegaConf

@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
    # Converts the entire config object to a YAML string for readable output
    print(OmegaConf.to_yaml(config))

if __name__ == "__main__":
    process_data()

Now run the script, overriding test_size on the command line:

uv run src/process.py process.test_size=0.3

Output:

process:
  strategy: drop_missing
  cols_to_drop:
  - id
  - timestamp
  - customer_id
  impute_strategy: null
  feature: quality
  test_size: 0.3

We can see that test_size is now 0.3 instead of 0.2!

This confirms that the test_size value was overridden at runtime, allowing you to test different settings quickly without editing the config file.

Grouping config files

In a data science project, you might have many ways to process your data, each with its own set of parameters. A common approach is to comment and uncomment blocks of configuration code to toggle between them, which leads to cluttered configs:

# conf/main.yaml
# process:
  # strategy: drop_missing
  # cols_to_drop: ["id", "timestamp", "customer_id"]
  # impute_strategy: null
  # feature: "quality"
  # test_size: 0.2
process:
  strategy: impute
  cols_to_drop: []
  impute_strategy: mean
  feature: "quality"
  test_size: 0.2
  
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate

Hydra supports organizing related configurations into groups, making it easier to manage variations of preprocessing steps, models, or training strategies in a clean and modular way.

Here’s how to set up and use a config group for processing options:

First, update your project structure to organize different processing strategies under a process/ config group:

.
└── conf/
    ├── main.yaml
    └── process/
        ├── drop_missing.yaml
        └── impute.yaml

Each file in the process/ folder contains parameters for a specific data preprocessing method. For example:

# conf/process/drop_missing.yaml
strategy: drop_missing
cols_to_drop: ["id", "timestamp", "customer_id"]
impute_strategy: null
feature: quality
test_size: 0.2
# conf/process/impute.yaml
strategy: impute
cols_to_drop: []
impute_strategy: mean
feature: quality
test_size: 0.2

Now, in main.yaml, reference the process group using Hydra’s defaults list:

defaults:
  - process: drop_missing
  - _self_
​
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate

To switch between groups, simply run:

uv run src/process.py process=impute

You can also group training strategies the same way:

conf/
├── main.yaml
├── process/
│   ├── drop_missing.yaml
│   └── impute.yaml
└── train/
    ├── basic.yaml
    └── advanced.yaml

Update main.yaml to include both groups:

defaults:
  - process: drop_missing
  - train: basic
  - _self_
​
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate

With this setup, you can mix and match different combinations of processing and training configurations using a single command:

uv run src/train_model.py process=impute train=advanced

This approach makes it easy to organize and switch between multiple configurations for data preprocessing, without touching your Python scripts.

Multi-run

When testing multiple processing strategies, running them one at a time can slow down your workflow:

uv run src/process.py process=drop_missing
# wait for this to finish
# then run the application with another configuration
uv run src/process.py process=impute

Hydra lets you run the same application across multiple configurations in a single command, eliminating the need to execute each variation manually.

uv run src/process.py --multirun process=drop_missing,impute

Output:

2025-05-15 11:55:20,260][HYDRA] Launching 2 jobs locally
[2025-05-15 11:55:20,260][HYDRA]        #0 : process=drop_missing
[2025-05-15 11:55:20,298][HYDRA]        #1 : process=impute

This approach streamlines the process of running an application with various parameters, ultimately saving valuable time and effort.

Final Thoughts

If you’re currently hard-coding parameters in your scripts, a quick way to get started with Hydra is to move those values into a YAML config file and access them using the @hydra.main decorator.

From there, experiment with command-line overrides and modular config groups to keep your pipeline clean and flexible. A small upfront investment—like moving a few parameters to a config file—can save you time and headaches as your project grows.

4 thoughts on “Hydra for Python Configuration: Build Modular and Maintainable Pipelines”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran