Table of Contents
- Why You Should Avoid Hard-Coding
- Configuration Files to the Rescue
- Introduction to Hydra
- Convenient Parameter Access
- Command-line configuration override
- Grouping config files
- Multi-run
- Final Thoughts
In data science projects, values like file names, features, split ratios, and hyperparameters often change. When these parameters are hard-coded, your code becomes inflexible and harder to maintain.
A better solution is to use a Python configuration system like Hydra config, which allows you to store settings in YAML files. This approach separates configuration from logic, making your code cleaner and more adaptable across different environments and experiments.
Why You Should Avoid Hard-Coding
Here are four major problems caused by hard-coded parameters:
Maintainability
Manually updating the same parameter across different files or functions is tedious and error-prone. For example, hard-coding a value like split_ratio across multiple scripts can lead to mismatches. If one script updates the value but another doesn’t, the code runs inconsistently and is harder to debug:
# script1.py
split_ratio = 0.3  # updated value
# script2.py
split_ratio = 0.2  # outdated value, not updated
Reusability
Hardcoding values limits the reusability of code for different scenarios. For example, the script below is tied to a specific dataset through a hard-coded file path. To use a different dataset, you’d have to manually update the path every time, which is error-prone and slows down iteration.
# preprocess.py
input_file = "data/input_v1.csv"  #needs to be updated manually to "data/input_v2.csv"
Security
Hard-coding secrets like API keys, passwords, or database URLs directly into scripts can be a serious risk. The example below shows hard-coded database credentials. If this file is pushed to a shared repository, those credentials could be exposed and lead to unauthorized access to your database.
# config.py
db_user = "admin"
db_password = "pa55word"  # hard-coded database credentials
Configuration Files to the Rescue
Configuration files help improve your workflow in the following ways:
Cleaner code and easier maintenance
Keeping configuration separate from logic makes scripts easier to read and maintain. You can change parameters without touching your core code.
# main.yaml
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate
cols_to_drop:
  - free sulfur dioxide
import pandas as pd
from omegaconf import OmegaConf
config = OmegaConf.load("main.yaml")
data = pd.read_csv(config.data.raw)
data = data.drop(columns=config.cols_to_drop)
Faster experimentation
Configuration files allow you to tweak parameters like features, splits, and hyperparameters without modifying the source code, enabling rapid iteration and experimentation.
# main.yaml
# Change from this
features: [age, income, education]
# To this without touching the source code
features: [age, income, education, credit_score]
Simplified deployment
With config files, adapting to different environments like development or production is straightforward. You can swap in the right settings without editing any logic.
# conf/database/dev.yaml
name: dev
db_url: sqlite:///dev.db
# conf/database/prod.yaml
name: prod
db_url: postgresql://prod_user:secure@prod.db.example.com/prod
# Run with dev settings
python main.py database=dev
​
# Run with prod settings
python main.py database=prod
Introduction to Hydra
Hydra is a modern Python library that simplifies how you structure and experiment with configurations. It helps you keep your code clean, flexible, and scalable by supporting:
- Intuitive access to parameters via dot notation
- Quick overrides from the command line for fast iteration
- Logical grouping of configs to manage complexity
- Multi-run execution to automate combinations of configurations
Key Takeaways
Here’s what you’ll learn:
- Eliminate hard-coded parameters by centralizing configuration in maintainable YAML files
- Override any configuration parameter from the command line for rapid experimentation without code changes
- Organize complex configurations into modular groups for clean separation of processing, training, and deployment settings
- Run multiple configuration combinations automatically using multi-run execution to accelerate hyperparameter testing
- Build flexible data pipelines that adapt across development and production environments seamlessly
📚 For advanced Hydra patterns and production configuration strategies, check out the complete book Production-Ready Data Science: From Prototyping to Production with Python.
Installation
You can install Hydra using either pip or uv:
# Option 1: pip
pip install hydra-core
# Option 2: uv (faster alternative to pip and Python)
uv add hydra-core
Let’s explore how each of these features improves data science workflows.
Convenient Parameter Access
Suppose all configuration files are stored under the conf folder, and all Python scripts are stored under the src folder.
.
├── conf/
│   └── main.yaml
└── src/
    ├── process.py
    └── train_model.py
And the main.yaml file looks like this:
process:
  cols_to_drop:
  - free sulfur dioxide
  feature: quality
  test_size: 0.2
train:
  hyperparameters:
    svm__kernel:
    - rbf
    svm__C:
    - 0.1
    - 1
    - 10
    svm__gamma:
    - 0.1
    - 1
    - 10
  grid_search:
    cv: 2
    scoring: accuracy
    verbose: 3
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate
model: models
You can load a configuration file in your Python script by decorating your main function with @hydra.main, which tells Hydra where to find and how to apply the configuration.
from omegaconf import DictConfig
import hydra
​
@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
  ...
In the code above, config is an instance of DictConfig, a flexible and hierarchical configuration object provided by OmegaConf. It behaves like both a dictionary and an object, allowing you to access parameters using dot notation ( config.key) or dictionary-style ( config['key']):
# src/process.py
import hydra
from omegaconf import DictConfig
​
​
@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
    print("Accessing with bracket notation:", config["process"]["cols_to_drop"])
    print("Accessing with dot notation:", config.process.cols_to_drop)
​
​
if __name__ == "__main__":
    process_data()
Running this Python script is straightforward:
python src/process.py
Or use uv, a modern Python CLI tool that replaces pip and Python for running scripts:
uv run src/process.py
Output:
Accessing with bracket notation: ['free sulfur dioxide']
Accessing with dot notation: ['free sulfur dioxide']
This straightforward approach allows you to effortlessly retrieve the desired parameters.
Command-line configuration override
Let’s say you are experimenting with different test_size. It is time-consuming to repeatedly open your configuration file and modify the test_size value.
# conf/main.yaml
process:
  cols_to_drop:
    - free sulfur dioxide
  feature: quality
  test_size: 0.3  # previously 0.2
Luckily, Hydra makes it easy to directly overwrite the configuration from the command line.
Let’s try overriding a parameter at runtime. Start with the following conf/main.yaml configuration:
process:
  strategy: drop_missing
  cols_to_drop:
    - id
    - timestamp
    - customer_id
  impute_strategy: null
  feature: quality
  test_size: 0.2
Then define src/process.py as follows:
# src/process.py
import hydra
from omegaconf import DictConfig, OmegaConf
@hydra.main(config_path="../conf", config_name="main", version_base=None)
def process_data(config: DictConfig):
    # Converts the entire config object to a YAML string for readable output
    print(OmegaConf.to_yaml(config))
if __name__ == "__main__":
    process_data()
Now run the script, overriding test_size on the command line:
uv run src/process.py process.test_size=0.3
Output:
process:
  strategy: drop_missing
  cols_to_drop:
  - id
  - timestamp
  - customer_id
  impute_strategy: null
  feature: quality
  test_size: 0.3
We can see that test_size is now 0.3 instead of 0.2!
This confirms that the test_size value was overridden at runtime, allowing you to test different settings quickly without editing the config file.
Grouping config files
In a data science project, you might have many ways to process your data, each with its own set of parameters. A common approach is to comment and uncomment blocks of configuration code to toggle between them, which leads to cluttered configs:
# conf/main.yaml
# process:
  # strategy: drop_missing
  # cols_to_drop: ["id", "timestamp", "customer_id"]
  # impute_strategy: null
  # feature: "quality"
  # test_size: 0.2
process:
  strategy: impute
  cols_to_drop: []
  impute_strategy: mean
  feature: "quality"
  test_size: 0.2
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate
Hydra supports organizing related configurations into groups, making it easier to manage variations of preprocessing steps, models, or training strategies in a clean and modular way.
Here’s how to set up and use a config group for processing options:
First, update your project structure to organize different processing strategies under a process/ config group:
.
└── conf/
    ├── main.yaml
    └── process/
        ├── drop_missing.yaml
        └── impute.yaml
Each file in the process/ folder contains parameters for a specific data preprocessing method. For example:
# conf/process/drop_missing.yaml
strategy: drop_missing
cols_to_drop: ["id", "timestamp", "customer_id"]
impute_strategy: null
feature: quality
test_size: 0.2
# conf/process/impute.yaml
strategy: impute
cols_to_drop: []
impute_strategy: mean
feature: quality
test_size: 0.2
Now, in main.yaml, reference the process group using Hydra’s defaults list:
defaults:
  - process: drop_missing
  - _self_
​
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate
To switch between groups, simply run:
uv run src/process.py process=impute
You can also group training strategies the same way:
conf/
├── main.yaml
├── process/
│   ├── drop_missing.yaml
│   └── impute.yaml
└── train/
    ├── basic.yaml
    └── advanced.yaml
Update main.yaml to include both groups:
defaults:
  - process: drop_missing
  - train: basic
  - _self_
​
data:
  raw: data/raw/winequality-red.csv
  intermediate: data/intermediate
With this setup, you can mix and match different combinations of processing and training configurations using a single command:
uv run src/train_model.py process=impute train=advanced
This approach makes it easy to organize and switch between multiple configurations for data preprocessing, without touching your Python scripts.
Multi-run
When testing multiple processing strategies, running them one at a time can slow down your workflow:
uv run src/process.py process=drop_missing
# wait for this to finish
# then run the application with another configuration
uv run src/process.py process=impute
Hydra lets you run the same application across multiple configurations in a single command, eliminating the need to execute each variation manually.
uv run src/process.py --multirun process=drop_missing,impute
Output:
2025-05-15 11:55:20,260][HYDRA] Launching 2 jobs locally
[2025-05-15 11:55:20,260][HYDRA]        #0 : process=drop_missing
[2025-05-15 11:55:20,298][HYDRA]        #1 : process=impute
This approach streamlines the process of running an application with various parameters, ultimately saving valuable time and effort.
Final Thoughts
If you’re currently hard-coding parameters in your scripts, a quick way to get started with Hydra is to move those values into a YAML config file and access them using the @hydra.main decorator.
From there, experiment with command-line overrides and modular config groups to keep your pipeline clean and flexible. A small upfront investment—like moving a few parameters to a config file—can save you time and headaches as your project grows.
 
								 
															 
                                                                                                                                                                                                            


5 thoughts on “Hydra for Python Configuration: Build Modular and Maintainable Pipelines”
Great article… Thank you
Thank you for the compliment!
This is my first time pay a quick visit at here and i am really happy to read everthing at one place
I am happy to hear that!
Thank you so much! I am happy to hear that