Poor data quality can have severe consequences, including incorrect conclusions and subpar model performance. Moreover, processing invalid or incorrect data is a waste of time and resources. Therefore, it is crucial to verify the consistency and reliability of data before using it.
In this blog post, we will explore Pandera, a Python library that simplifies data validation for dataframe-like objects. We will demonstrate how to define a schema, validate data, and use decorators to ensure the quality of data passed to functions.
Defining a Schema with Pandera
Pandera provides a simple way to define a schema using the DataFrameSchema class. In the example below, we create a schema for a student dataframe with three columns: name, age, and score.
import pandas as pd
import pandera as pa
student_schema = pa.DataFrameSchema(
{
"name": pa.Column(str),
"age": pa.Column(int, pa.Check.between(0, 120)),
"score": pa.Column(float, pa.Check.between(0, 100)),
}
) This code defines a schema using DataFrameSchema with three columns:
name: a string columnage: an integer column that must be between 0 and 120score: a float column that must be between 0 and 100
Validating Data with Pandera
Once we have defined the schema, we can validate a dataframe against it using the validate method.
student_df = pd.DataFrame(
{
"name": ["John", "Jane", "Bob"],
"age": [25, 30, 35],
"score": [95.5, 88.3, 92.7],
}
)
student_schema.validate(student_df) | name | age | score | |
|---|---|---|---|
| 0 | John | 25 | 95.5 |
| 1 | Jane | 30 | 88.3 |
| 2 | Bob | 35 | 92.7 |
If the dataframe conforms to the schema, the validate method returns the validated dataframe. Otherwise, it raises a SchemaError with a descriptive error message.
invalid_student_df = pd.DataFrame(
{
"name": ["John", "Jane", "Bob"],
"age": [25, 30, 200],
"score": [95.5, 88.3, 92.7],
}
)
try:
student_schema.validate(invalid_student_df)
except pa.errors.SchemaError as err:
print("SchemarError:", err) SchemarError: Column 'age' failed element-wise validator number 0: in_range(0, 120) failure cases: 200 Using Decorators to Validate Function Inputs
Pandera provides a check_input decorator that allows us to validate the inputs of a function before calling it. In the example below, we define a function calculate_grade that takes a dataframe as input and calculates the grade based on the score.
from pandera import check_input
@check_input(student_schema)
def calculate_grade(data: pd.DataFrame):
data["grade"] = pd.cut(
data["score"],
bins=[0, 70, 80, 90, 100],
labels=["F", "C", "B", "A"],
include_lowest=True,
)
return data | name | age | score | grade | |
|---|---|---|---|---|
| 0 | John | 25 | 95.5 | A |
| 1 | Jane | 30 | 88.3 | B |
| 2 | Bob | 35 | 92.7 | A |
When we call the calculate_grade function with a dataframe that conforms to the schema, it returns the dataframe with the calculated grade.
If the input dataframe does not conform to the schema, the decorator raises a SchemaError.
invalid_student_df = pd.DataFrame(
{
"name": ["John", "Jane", "Bob"],
"age": [25, 30, 200],
"score": [95.5, 88.3, 92.7],
}
)
try:
result = calculate_grade(invalid_student_df)
except pa.errors.SchemaError as err:
print("SchemaError:", err) SchemaError: error in check_input decorator of function 'calculate_grade': Column 'age' failed element-wise validator number 0: in_range(0, 120) failure cases: 200


