Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Pandera: Data Validation Made Simple for Python DataFrames

Table of Contents

Pandera: Data Validation Made Simple for Python DataFrames

Poor data quality can have severe consequences, including incorrect conclusions and subpar model performance. Moreover, processing invalid or incorrect data is a waste of time and resources. Therefore, it is crucial to verify the consistency and reliability of data before using it.

In this blog post, we will explore Pandera, a Python library that simplifies data validation for dataframe-like objects. We will demonstrate how to define a schema, validate data, and use decorators to ensure the quality of data passed to functions.

Defining a Schema with Pandera

Pandera provides a simple way to define a schema using the DataFrameSchema class. In the example below, we create a schema for a student dataframe with three columns: name, age, and score.

import pandas as pd
import pandera as pa

student_schema = pa.DataFrameSchema(
    {
        "name": pa.Column(str),
        "age": pa.Column(int, pa.Check.between(0, 120)),
        "score": pa.Column(float, pa.Check.between(0, 100)),
    }
)

This code defines a schema using DataFrameSchema with three columns:

  1. name: a string column
  2. age: an integer column that must be between 0 and 120
  3. score: a float column that must be between 0 and 100

Validating Data with Pandera

Once we have defined the schema, we can validate a dataframe against it using the validate method.

student_df = pd.DataFrame(
    {
        "name": ["John", "Jane", "Bob"],
        "age": [25, 30, 35],
        "score": [95.5, 88.3, 92.7],
    }
)

student_schema.validate(student_df)
nameagescore
0John2595.5
1Jane3088.3
2Bob3592.7

If the dataframe conforms to the schema, the validate method returns the validated dataframe. Otherwise, it raises a SchemaError with a descriptive error message.

invalid_student_df = pd.DataFrame(
    {
        "name": ["John", "Jane", "Bob"],
        "age": [25, 30, 200],
        "score": [95.5, 88.3, 92.7],
    }
)

try:
    student_schema.validate(invalid_student_df)
except pa.errors.SchemaError as err:
    print("SchemarError:", err)
SchemarError: Column 'age' failed element-wise validator number 0: in_range(0, 120) failure cases: 200

Using Decorators to Validate Function Inputs

Pandera provides a check_input decorator that allows us to validate the inputs of a function before calling it. In the example below, we define a function calculate_grade that takes a dataframe as input and calculates the grade based on the score.

from pandera import check_input

@check_input(student_schema)
def calculate_grade(data: pd.DataFrame):
    data["grade"] = pd.cut(
        data["score"],
        bins=[0, 70, 80, 90, 100],
        labels=["F", "C", "B", "A"],
        include_lowest=True,
    )
    return data
nameagescoregrade
0John2595.5A
1Jane3088.3B
2Bob3592.7A

When we call the calculate_grade function with a dataframe that conforms to the schema, it returns the dataframe with the calculated grade.

If the input dataframe does not conform to the schema, the decorator raises a SchemaError.

invalid_student_df = pd.DataFrame(
    {
        "name": ["John", "Jane", "Bob"],
        "age": [25, 30, 200],
        "score": [95.5, 88.3, 92.7],
    }
)

try:
    result = calculate_grade(invalid_student_df)
except pa.errors.SchemaError as err:
    print("SchemaError:", err)
SchemaError: error in check_input decorator of function 'calculate_grade': Column 'age' failed element-wise validator number 0: in_range(0, 120) failure cases: 200

Link to Pandera.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran