Pandera: Data Validation Made Simple for Python DataFrames

Khuyen Tran

Poor data quality can have severe consequences, including incorrect conclusions and subpar model performance. Moreover, processing invalid or incorrect data is a waste of time and resources. Therefore, it is crucial to verify the consistency and reliability of data before using it.

In this blog post, we will explore Pandera, a Python library that simplifies data validation for dataframe-like objects. We will demonstrate how to define a schema, validate data, and use decorators to ensure the quality of data passed to functions.

Defining a Schema with Pandera

Pandera provides a simple way to define a schema using the DataFrameSchema class. In the example below, we create a schema for a student dataframe with three columns: name, age, and score.

import pandas as pd
import pandera as pa

student_schema = pa.DataFrameSchema(
    {
        "name": pa.Column(str),
        "age": pa.Column(int, pa.Check.between(0, 120)),
        "score": pa.Column(float, pa.Check.between(0, 100)),
    }
)

This code defines a schema using DataFrameSchema with three columns:

name: a string column
age: an integer column that must be between 0 and 120
score: a float column that must be between 0 and 100

Validating Data with Pandera

Once we have defined the schema, we can validate a dataframe against it using the validate method.

student_df = pd.DataFrame(
    {
        "name": ["John", "Jane", "Bob"],
        "age": [25, 30, 35],
        "score": [95.5, 88.3, 92.7],
    }
)

student_schema.validate(student_df)

	name	age	score
0	John	25	95.5
1	Jane	30	88.3
2	Bob	35	92.7

If the dataframe conforms to the schema, the validate method returns the validated dataframe. Otherwise, it raises a SchemaError with a descriptive error message.

invalid_student_df = pd.DataFrame(
    {
        "name": ["John", "Jane", "Bob"],
        "age": [25, 30, 200],
        "score": [95.5, 88.3, 92.7],
    }
)

try:
    student_schema.validate(invalid_student_df)
except pa.errors.SchemaError as err:
    print("SchemarError:", err)

SchemarError: Column 'age' failed element-wise validator number 0: in_range(0, 120) failure cases: 200

Using Decorators to Validate Function Inputs

Pandera provides a check_input decorator that allows us to validate the inputs of a function before calling it. In the example below, we define a function calculate_grade that takes a dataframe as input and calculates the grade based on the score.

from pandera import check_input

@check_input(student_schema)
def calculate_grade(data: pd.DataFrame):
    data["grade"] = pd.cut(
        data["score"],
        bins=[0, 70, 80, 90, 100],
        labels=["F", "C", "B", "A"],
        include_lowest=True,
    )
    return data

	name	age	score	grade
0	John	25	95.5	A
1	Jane	30	88.3	B
2	Bob	35	92.7	A

When we call the calculate_grade function with a dataframe that conforms to the schema, it returns the dataframe with the calculated grade.

If the input dataframe does not conform to the schema, the decorator raises a SchemaError.

invalid_student_df = pd.DataFrame(
    {
        "name": ["John", "Jane", "Bob"],
        "age": [25, 30, 200],
        "score": [95.5, 88.3, 92.7],
    }
)

try:
    result = calculate_grade(invalid_student_df)
except pa.errors.SchemaError as err:
    print("SchemaError:", err)

SchemaError: error in check_input decorator of function 'calculate_grade': Column 'age' failed element-wise validator number 0: in_range(0, 120) failure cases: 200

Link to Pandera.