In data science projects, values like file names, features, split ratios, and hyperparameters often change. When these parameters are hard-coded, your code becomes inflexible and harder to maintain.
A better solution is to use a Python configuration system like Hydra config, which allows you to store settings in YAML files. This approach separates configuration from logic, making your code cleaner and more adaptable across different environments and experiments.
For a broader look at organizing your entire project structure—not just configs—see How to Structure a Data Science Project for Readability and Transparency.
The source code of this article can be found here:
Why You Should Avoid Hard-Coding
Here are four major problems caused by hard-coded parameters:
Maintainability
Manually updating the same parameter across different files or functions is tedious and error-prone. For example, hard-coding a value like split_ratio
across multiple scripts can lead to mismatches. If one script updates the value but another doesn’t, the code runs inconsistently and is harder to debug:
# script1.py
split_ratio = 0.3 # updated value
# script2.py
split_ratio = 0.2 # outdated value, not updated
Reusability
Hardcoding values limits the reusability of code for different scenarios. For example, the script below is tied to a specific dataset through a hard-coded file path. To use a different dataset, you’d have to manually update the path every time, which is error-prone and slows down iteration.
# preprocess.py
input_file = "data/input_v1.csv" #needs to be updated manually to "data/input_v2.csv"
Security
Hard-coding secrets like API keys, passwords, or database URLs directly into scripts can be a serious risk. The example below shows hard-coded database credentials. If this file is pushed to a shared repository, those credentials could be exposed and lead to unauthorized access to your database.
# config.py
db_user = "admin"
db_password = "pa55word" # hard-coded database credentials
To handle secrets securely, consider storing them in environment variables or
.env
files. This guide explains how to manage sensitive information in Python using .env files.
Configuration Files to the Rescue
Configuration files help improve your workflow in the following ways:
Cleaner code and easier maintenance
Keeping configuration separate from logic makes scripts easier to read and maintain. You can change parameters without touching your core code.
# main.yaml
data:
raw: data/raw/winequality-red.csv
intermediate: data/intermediate
cols_to_drop:
- free sulfur dioxide
import pandas as pd
from omegaconf import OmegaConf
config = OmegaConf.load("main.yaml")
data = pd.read_csv(config.data.raw)
data = data.drop(columns=config.cols_to_drop)
Faster experimentation
Configuration files allow you to tweak parameters like features, splits, and hyperparameters without modifying the source code, enabling rapid iteration and experimentation.
# main.yaml
# Change from this
features: [age, income, education]
# To this without touching the source code
features: [age, income, education, credit_score]
Simplified deployment
With config files, adapting to different environments like development or production is straightforward. You can swap in the right settings without editing any logic.
# conf/database/dev.yaml
name: dev
db_url: sqlite:///dev.db
# conf/database/prod.yaml
name: prod
db_url: postgresql://prod_user:secure@prod.db.example.com/prod
# Run with dev settings
python main.py database=dev
# Run with prod settings
python main.py database=prod
Introduction to Hydra
Hydra is a modern Python library that simplifies how you structure and experiment with configurations. It helps you keep your code clean, flexible, and scalable by supporting:
- Intuitive access to parameters via dot notation
- Quick overrides from the command line for fast iteration
- Logical grouping of configs to manage complexity
- Multi-run execution to automate combinations of configurations
Installation
You can install Hydra using either pip or uv:
# Option 1: pip
pip install hydra-core
# Option 2: uv (faster alternative to pip and Python)
uv add hydra-core
Let’s explore how each of these features improves data science workflows.
Convenient Parameter Access
Suppose all configuration files are stored under the conf
folder, and all Python scripts are stored under the src
folder.
.
├── conf/
│ └── main.yaml
└── src/
├── process.py
└── train_model.py
And the main.yaml
file looks like this:
process:
cols_to_drop:
- free sulfur dioxide
feature: quality
test_size: 0.2
train:
hyperparameters:
svm__kernel:
- rbf
svm__C:
- 0.1
- 1
- 10
svm__gamma:
- 0.1
- 1
- 10
grid_search:
cv: 2
scoring: accuracy
verbose: 3
data:
raw: data/raw/winequality-red.csv
intermediate: data/intermediate
model: models
You can load a configuration file in your Python script by decorating your main function with @hydra.main
, which tells Hydra where to find and how to apply the configuration.
from omegaconf import DictConfig
import hydra
@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
...
In the code above, config
is an instance of DictConfig
, a flexible and hierarchical configuration object provided by OmegaConf. It behaves like both a dictionary and an object, allowing you to access parameters using dot notation (config.key
) or dictionary-style (config['key']
):
# src/process.py
import hydra
from omegaconf import DictConfig
@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
print("Accessing with bracket notation:", config["process"]["cols_to_drop"])
print("Accessing with dot notation:", config.process.cols_to_drop)
if __name__ == "__main__":
process_data()
Running this Python script is straightforward:
python src/process.py
Or use uv, a modern Python CLI tool that replaces pip and Python for running scripts:
uv run src/process.py
Output:
Accessing with bracket notation: ['free sulfur dioxide']
Accessing with dot notation: ['free sulfur dioxide']
This straightforward approach allows you to effortlessly retrieve the desired parameters.
Command-line configuration override
Let’s say you are experimenting with different test_size
. It is time-consuming to repeatedly open your configuration file and modify the test_size
value.
# conf/main.yaml
process:
cols_to_drop:
- free sulfur dioxide
feature: quality
test_size: 0.3 # previously 0.2
Luckily, Hydra makes it easy to directly overwrite the configuration from the command line.
Let’s try overriding a parameter at runtime. Start with the following conf/main.yaml
configuration:
process:
strategy: drop_missing
cols_to_drop:
- id
- timestamp
- customer_id
impute_strategy: null
feature: quality
test_size: 0.2
Then define src/process.py
as follows:
# src/process.py
import hydra
from omegaconf import DictConfig, OmegaConf
@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
# Converts the entire config object to a YAML string for readable output
print(OmegaConf.to_yaml(config))
if __name__ == "__main__":
process_data()
Now run the script, overriding test_size
on the command line:
uv run src/process.py process.test_size=0.3
Output:
process:
strategy: drop_missing
cols_to_drop:
- id
- timestamp
- customer_id
impute_strategy: null
feature: quality
test_size: 0.3
We can see that test_size
is now 0.3 instead of 0.2!
This confirms that the test_size
value was overridden at runtime, allowing you to test different settings quickly without editing the config file.
Grouping config files
In a data science project, you might have many ways to process your data, each with its own set of parameters. A common approach is to comment and uncomment blocks of configuration code to toggle between them, which leads to cluttered configs:
# conf/main.yaml
# process:
# strategy: drop_missing
# cols_to_drop: ["id", "timestamp", "customer_id"]
# impute_strategy: null
# feature: "quality"
# test_size: 0.2
process:
strategy: impute
cols_to_drop: []
impute_strategy: mean
feature: "quality"
test_size: 0.2
data:
raw: data/raw/winequality-red.csv
intermediate: data/intermediate
Hydra supports organizing related configurations into groups, making it easier to manage variations of preprocessing steps, models, or training strategies in a clean and modular way.
Here’s how to set up and use a config group for processing options:
First, update your project structure to organize different processing strategies under a process/
config group:
.
└── conf/
├── main.yaml
└── process/
├── drop_missing.yaml
└── impute.yaml
Each file in the process/
folder contains parameters for a specific data preprocessing method. For example:
# conf/process/drop_missing.yaml
strategy: drop_missing
cols_to_drop: ["id", "timestamp", "customer_id"]
impute_strategy: null
feature: quality
test_size: 0.2
# conf/process/impute.yaml
strategy: impute
cols_to_drop: []
impute_strategy: mean
feature: quality
test_size: 0.2
Now, in main.yaml
, reference the process
group using Hydra’s defaults
list:
defaults:
- process: drop_missing
- _self_
data:
raw: data/raw/winequality-red.csv
intermediate: data/intermediate
To switch between groups, simply run:
uv run src/process.py process=impute
You can also group training strategies the same way:
conf/
├── main.yaml
├── process/
│ ├── drop_missing.yaml
│ └── impute.yaml
└── train/
├── basic.yaml
└── advanced.yaml
Update main.yaml
to include both groups:
defaults:
- process: drop_missing
- train: basic
- _self_
data:
raw: data/raw/winequality-red.csv
intermediate: data/intermediate
With this setup, you can mix and match different combinations of processing and training configurations using a single command:
uv run src/train_model.py process=impute train=advanced
This approach makes it easy to organize and switch between multiple configurations for data preprocessing, without touching your Python scripts.
Multi-run
When testing multiple processing strategies, running them one at a time can slow down your workflow:
uv run src/process.py process=drop_missing
# wait for this to finish
# then run the application with another configuration
uv run src/process.py process=impute
Hydra lets you run the same application across multiple configurations in a single command, eliminating the need to execute each variation manually.
uv run src/process.py --multirun process=drop_missing,impute
Output:
2025-05-15 11:55:20,260][HYDRA] Launching 2 jobs locally
[2025-05-15 11:55:20,260][HYDRA] #0 : process=drop_missing
[2025-05-15 11:55:20,298][HYDRA] #1 : process=impute
This approach streamlines the process of running an application with various parameters, ultimately saving valuable time and effort.
Final Thoughts
If you’re currently hard-coding parameters in your scripts, a quick way to get started with Hydra is to move those values into a YAML config file and access them using the @hydra.main
decorator.
From there, experiment with command-line overrides and modular config groups to keep your pipeline clean and flexible. A small upfront investment—like moving a few parameters to a config file—can save you time and headaches as your project grows.
4 thoughts on “Hydra for Python Configuration: Build Modular and Maintainable Pipelines”
Great article… Thank you
Thank you for the compliment!
This is my first time pay a quick visit at here and i am really happy to read everthing at one place
I am happy to hear that!