Hydra for Python Configuration: Build Modular and Maintainable Pipelines

Khuyen Tran

Why You Should Avoid Hard-Coding
Configuration Files to the Rescue
Introduction to Hydra
- Installation
Convenient Parameter Access
Command-line configuration override
Grouping config files
Multi-run
Final Thoughts

In data science projects, values like file names, features, split ratios, and hyperparameters often change. When these parameters are hard-coded, your code becomes inflexible and harder to maintain.

A better solution is to use a Python configuration system like Hydra config, which allows you to store settings in YAML files. This approach separates configuration from logic, making your code cleaner and more adaptable across different environments and experiments.

Why You Should Avoid Hard-Coding

Here are four major problems caused by hard-coded parameters:

Maintainability

Manually updating the same parameter across different files or functions is tedious and error-prone. For example, hard-coding a value like split_ratio across multiple scripts can lead to mismatches. If one script updates the value but another doesn’t, the code runs inconsistently and is harder to debug:

#| eval: false
# script1.py
split_ratio = 0.3  # updated value

#| eval: false
# script2.py
split_ratio = 0.2  # outdated value, not updated

Reusability

Hardcoding values limits the reusability of code for different scenarios. For example, the script below is tied to a specific dataset through a hard-coded file path. To use a different dataset, you’d have to manually update the path every time, which is error-prone and slows down iteration.

#| eval: false
# preprocess.py
input_file = "data/input_v1.csv"  #needs to be updated manually to "data/input_v2.csv"

Security

Hard-coding secrets like API keys, passwords, or database URLs directly into scripts can be a serious risk. The example below shows hard-coded database credentials. If this file is pushed to a shared repository, those credentials could be exposed and lead to unauthorized access to your database.

#| eval: false
# config.py
db_user = "admin"
db_password = "pa55word"  # hard-coded database credentials

Configuration Files to the Rescue

Configuration files help improve your workflow in the following ways:

Cleaner code and easier maintenance

Keeping configuration separate from logic makes scripts easier to read and maintain. You can change parameters without touching your core code.

# main.yaml
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate

cols_to_drop:

  - free sulfur dioxide

#| eval: false
import pandas as pd
from omegaconf import OmegaConf

config = OmegaConf.load("main.yaml")

data = pd.read_csv(config.data.raw)
data = data.drop(columns=config.cols_to_drop)

Faster experimentation

Configuration files allow you to tweak parameters like features, splits, and hyperparameters without modifying the source code, enabling rapid iteration and experimentation.

# main.yaml

# Change from this
features: [age, income, education]

# To this without touching the source code
features: [age, income, education, credit_score]

Simplified deployment

With config files, adapting to different environments like development or production is straightforward. You can swap in the right settings without editing any logic.

# conf/database/dev.yaml
name: dev
db_url: sqlite:///dev.db

# conf/database/prod.yaml
name: prod
db_url: postgresql://prod_user:secure@prod.db.example.com/prod

# Run with dev settings
python main.py database=dev

# Run with prod settings
python main.py database=prod

Introduction to Hydra

Hydra is a modern Python library that simplifies how you structure and experiment with configurations. It helps you keep your code clean, flexible, and scalable by supporting:

Intuitive access to parameters via dot notation
Quick overrides from the command line for fast iteration
Logical grouping of configs to manage complexity
Multi-run execution to automate combinations of configurations

Installation

You can install Hydra using either pip or uv:

#| eval: false
# Option 1: pip
pip install hydra-core

# Option 2: uv (faster alternative to pip and Python)
uv add hydra-core

For advanced Hydra patterns and production configuration strategies, check out the complete book Production-Ready Data Science: From Prototyping to Production with Python.

Let’s explore how each of these features improves data science workflows.

Convenient Parameter Access

Suppose all configuration files are stored under the conf folder, and all Python scripts are stored under the src folder.

.
├── conf/
│   └── main.yaml
└── src/
    ├── process.py
    └── train_model.py

And the main.yaml file looks like this:

process:
  cols_to_drop:

  - free sulfur dioxide
  feature: quality
  test_size: 0.2
train:
  hyperparameters:
    svm__kernel:

    - rbf
    svm__C:

    - 0.1
    - 1
    - 10
    svm__gamma:

    - 0.1
    - 1
    - 10
  grid_search:
    cv: 2
    scoring: accuracy
    verbose: 3
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate
model: models

You can load a configuration file in your Python script by decorating your main function with @hydra.main, which tells Hydra where to find and how to apply the configuration.

#| eval: false
from omegaconf import DictConfig
import hydra

@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
  ...

In the code above, config is an instance of DictConfig, a flexible and hierarchical configuration object provided by OmegaConf. It behaves like both a dictionary and an object, allowing you to access parameters using dot notation ( config.key) or dictionary-style ( config['key']):

#| eval: false
# src/process.py
import hydra
from omegaconf import DictConfig


@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
    print("Accessing with bracket notation:", config["process"]["cols_to_drop"])
    print("Accessing with dot notation:", config.process.cols_to_drop)


if __name__ == "__main__":
    process_data()

Running this Python script is straightforward:

python src/process.py

Or use uv, a modern Python CLI tool that replaces pip and Python for running scripts:

uv run src/process.py

Output:

Accessing with bracket notation: ['free sulfur dioxide']
Accessing with dot notation: ['free sulfur dioxide']

This straightforward approach allows you to effortlessly retrieve the desired parameters.

Command-line configuration override

Let’s say you are experimenting with different test_size. It is time-consuming to repeatedly open your configuration file and modify the test_size value.

# conf/main.yaml
process:
  cols_to_drop:

    - free sulfur dioxide
  feature: quality
  test_size: 0.3  # previously 0.2

Luckily, Hydra makes it easy to directly overwrite the configuration from the command line.

Let’s try overriding a parameter at runtime. Start with the following conf/main.yaml configuration:

process:
  strategy: drop_missing
  cols_to_drop:

    - id
    - timestamp
    - customer_id
  impute_strategy: null
  feature: quality
  test_size: 0.2

Then define src/process.py as follows:

#| eval: false
# src/process.py
import hydra
from omegaconf import DictConfig, OmegaConf

@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
    # Converts the entire config object to a YAML string for readable output
    print(OmegaConf.to_yaml(config))

if __name__ == "__main__":
    process_data()

Now run the script, overriding test_size on the command line:

uv run src/process.py process.test_size=0.3

Output:

process:
  strategy: drop_missing
  cols_to_drop:

  - id
  - timestamp
  - customer_id
  impute_strategy: null
  feature: quality
  test_size: 0.3

We can see that test_size is now 0.3 instead of 0.2!

This confirms that the test_size value was overridden at runtime, allowing you to test different settings quickly without editing the config file.

Grouping config files

In a data science project, you might have many ways to process your data, each with its own set of parameters. A common approach is to comment and uncomment blocks of configuration code to toggle between them, which leads to cluttered configs:

# conf/main.yaml
# process:
  # strategy: drop_missing
  # cols_to_drop: ["id", "timestamp", "customer_id"]
  # impute_strategy: null
  # feature: "quality"
  # test_size: 0.2
process:
  strategy: impute
  cols_to_drop: []
  impute_strategy: mean
  feature: "quality"
  test_size: 0.2

data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate

Hydra supports organizing related configurations into groups, making it easier to manage variations of preprocessing steps, models, or training strategies in a clean and modular way.

Here’s how to set up and use a config group for processing options:

First, update your project structure to organize different processing strategies under a process/ config group:

.
└── conf/
    ├── main.yaml
    └── process/
        ├── drop_missing.yaml
        └── impute.yaml

Each file in the process/ folder contains parameters for a specific data preprocessing method. For example:

# conf/process/drop_missing.yaml
strategy: drop_missing
cols_to_drop: ["id", "timestamp", "customer_id"]
impute_strategy: null
feature: quality
test_size: 0.2

# conf/process/impute.yaml
strategy: impute
cols_to_drop: []
impute_strategy: mean
feature: quality
test_size: 0.2

Now, in main.yaml, reference the process group using Hydra’s defaults list:

defaults:

  - process: drop_missing
  - _self_

data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate

To switch between groups, simply run:

uv run src/process.py process=impute

You can also group training strategies the same way:

conf/
├── main.yaml
├── process/
│   ├── drop_missing.yaml
│   └── impute.yaml
└── train/
    ├── basic.yaml
    └── advanced.yaml

Update main.yaml to include both groups:

defaults:

  - process: drop_missing
  - train: basic
  - _self_

data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate

With this setup, you can mix and match different combinations of processing and training configurations using a single command:

uv run src/train_model.py process=impute train=advanced

This approach makes it easy to organize and switch between multiple configurations for data preprocessing, without touching your Python scripts.

Multi-run

When testing multiple processing strategies, running them one at a time can slow down your workflow:

uv run src/process.py process=drop_missing
# wait for this to finish
# then run the application with another configuration
uv run src/process.py process=impute

Hydra lets you run the same application across multiple configurations in a single command, eliminating the need to execute each variation manually.

uv run src/process.py --multirun process=drop_missing,impute

Output:

2025-05-15 11:55:20,260][HYDRA] Launching 2 jobs locally
[2025-05-15 11:55:20,260][HYDRA]        #0 : process=drop_missing
[2025-05-15 11:55:20,298][HYDRA]        #1 : process=impute

This approach streamlines the process of running an application with various parameters, ultimately saving valuable time and effort.

Final Thoughts

If you’re currently hard-coding parameters in your scripts, a quick way to get started with Hydra is to move those values into a YAML config file and access them using the @hydra.main decorator.

From there, experiment with command-line overrides and modular config groups to keep your pipeline clean and flexible. A small upfront investment—like moving a few parameters to a config file—can save you time and headaches as your project grows.

Managing Shared Data Science Code with Git Submodules

July 25, 2025

Build Production-Ready RAG Systems with MLflow Quality Metrics

July 13, 2025

Version Control for Data and Models Using DVC

May 10, 2025

5 thoughts on “Hydra for Python Configuration: Build Modular and Maintainable Pipelines”

Mohammed Barniyah
May 27, 2023 at 6:28 am

Great article… Thank you

Reply
1. Khuyen Tran
  May 30, 2023 at 10:59 am
  
  Thank you for the compliment!
  
  Reply
Cristofer Mcconnell
May 18, 2025 at 12:22 am

This is my first time pay a quick visit at here and i am really happy to read everthing at one place

Reply
1. Khuyen Tran
  May 18, 2025 at 8:34 am
  
  I am happy to hear that!
  
  Reply
2. Khuyen Tran
  July 20, 2025 at 7:26 pm
  
  Thank you so much! I am happy to hear that
  
  Reply

Hydra for Python Configuration: Build Modular and Maintainable Pipelines

Table of Contents

Hydra for Python Configuration: Build Modular and Maintainable Pipelines

Khuyen Tran

Table of Contents

Why You Should Avoid Hard-Coding

Maintainability

Reusability

Security

Configuration Files to the Rescue

Cleaner code and easier maintenance

Faster experimentation

Simplified deployment

Introduction to Hydra

Installation

Convenient Parameter Access

Command-line configuration override

Grouping config files

Multi-run

Final Thoughts

Related Posts

5 thoughts on “Hydra for Python Configuration: Build Modular and Maintainable Pipelines”

Leave a Comment Cancel Reply

Drop a line

Get in touch

Follow Us on Social Media

Hydra for Python Configuration: Build Modular and Maintainable Pipelines

Table of Contents

Hydra for Python Configuration: Build Modular and Maintainable Pipelines

Khuyen Tran

Table of Contents

Why You Should Avoid Hard-Coding

Maintainability

Reusability

Security

Configuration Files to the Rescue

Cleaner code and easier maintenance

Faster experimentation

Simplified deployment

Introduction to Hydra

Installation

Convenient Parameter Access

Command-line configuration override

Grouping config files

Multi-run

Final Thoughts

Related Posts

5 thoughts on “Hydra for Python Configuration: Build Modular and Maintainable Pipelines”

Leave a Comment Cancel Reply

Work with Khuyen Tran

Work with Khuyen Tran