Poor data quality can have severe consequences, including incorrect conclusions and subpar model performance. Moreover, processing invalid or incorrect data is a waste of time and resources. Therefore, it is crucial to verify the consistency and reliability of data before using it.
In this blog post, we will explore Pandera, a Python library that simplifies data validation for dataframe-like objects. We will demonstrate how to define a schema, validate data, and use decorators to ensure the quality of data passed to functions.
Defining a Schema with Pandera
Pandera provides a simple way to define a schema using the DataFrameSchema
class. In the example below, we create a schema for a student dataframe with three columns: name
, age
, and score
.
import pandas as pd
import pandera as pa
student_schema = pa.DataFrameSchema(
{
"name": pa.Column(str),
"age": pa.Column(int, pa.Check.between(0, 120)),
"score": pa.Column(float, pa.Check.between(0, 100)),
}
)
This code defines a schema using DataFrameSchema
with three columns:
name
: a string columnage
: an integer column that must be between 0 and 120score
: a float column that must be between 0 and 100
Validating Data with Pandera
Once we have defined the schema, we can validate a dataframe against it using the validate
method.
student_df = pd.DataFrame(
{
"name": ["John", "Jane", "Bob"],
"age": [25, 30, 35],
"score": [95.5, 88.3, 92.7],
}
)
student_schema.validate(student_df)
name | age | score | |
---|---|---|---|
0 | John | 25 | 95.5 |
1 | Jane | 30 | 88.3 |
2 | Bob | 35 | 92.7 |
If the dataframe conforms to the schema, the validate
method returns the validated dataframe. Otherwise, it raises a SchemaError
with a descriptive error message.
invalid_student_df = pd.DataFrame(
{
"name": ["John", "Jane", "Bob"],
"age": [25, 30, 200],
"score": [95.5, 88.3, 92.7],
}
)
try:
student_schema.validate(invalid_student_df)
except pa.errors.SchemaError as err:
print("SchemarError:", err)
SchemarError: Column 'age' failed element-wise validator number 0: in_range(0, 120) failure cases: 200
Using Decorators to Validate Function Inputs
Pandera provides a check_input
decorator that allows us to validate the inputs of a function before calling it. In the example below, we define a function calculate_grade
that takes a dataframe as input and calculates the grade based on the score.
from pandera import check_input
@check_input(student_schema)
def calculate_grade(data: pd.DataFrame):
data["grade"] = pd.cut(
data["score"],
bins=[0, 70, 80, 90, 100],
labels=["F", "C", "B", "A"],
include_lowest=True,
)
return data
name | age | score | grade | |
---|---|---|---|---|
0 | John | 25 | 95.5 | A |
1 | Jane | 30 | 88.3 | B |
2 | Bob | 35 | 92.7 | A |
When we call the calculate_grade
function with a dataframe that conforms to the schema, it returns the dataframe with the calculated grade.
If the input dataframe does not conform to the schema, the decorator raises a SchemaError
.
invalid_student_df = pd.DataFrame(
{
"name": ["John", "Jane", "Bob"],
"age": [25, 30, 200],
"score": [95.5, 88.3, 92.7],
}
)
try:
result = calculate_grade(invalid_student_df)
except pa.errors.SchemaError as err:
print("SchemaError:", err)
SchemaError: error in check_input decorator of function 'calculate_grade': Column 'age' failed element-wise validator number 0: in_range(0, 120) failure cases: 200