Validating Polars DataFrames with Pandera

Validating Polars DataFrames with Pandera

Pandera is a Python library that provides a simple and efficient way to validate pandas DataFrames. Recently, Pandera has added support for Polars, a fast and lightweight DataFrame library written in Rust. In this example, we will demonstrate how to use Pandera to validate Polars DataFrames.

Defining a Schema

To validate a Polars DataFrame, we first need to define a schema using the pandera.polars module. A schema is a class that defines the structure and constraints of the DataFrame.

import pandera.polars as pa
import polars as pl

class Schema(pa.DataFrameModel):
    state: str = pa.Field(isin=["FL", "CA"])
    city: str
    price: int = pa.Field(in_range={"min_value": 5, "max_value": 20})

In this example, the schema defines three columns: state, city, and price. The price column has an additional constraint that its values must be between 5 and 20.

Validating a Polars DataFrame

Once we have defined the schema, we can validate a Polars DataFrame using the validate() method.

lf = pl.LazyFrame(
    {
        "state": ["FL", "FL", "FL", "CA", "CA", "CA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": [8, 12, 10, 16, 20, 18],
    }
)
Schema.validate(lf).collect()
statecityprice
strstri64
“FL”“Orlando”8
“FL”“Miami”12
“FL”“Tampa”10
“CA”“San Francisco”16
“CA”“Los Angeles”20
“CA”“San Diego”18

The validate() method checks if the DataFrame conforms to the schema and returns a new DataFrame with the validated data.

Using the check_types() Decorator

Pandera also provides a check_types() decorator that can be used to validate Polars DataFrame function annotations at runtime.

from pandera.typing.polars import LazyFrame

@pa.check_types
def filter_state(lf: LazyFrame[Schema], state: str) -> LazyFrame[Schema]:
    return lf.filter(pl.col("state").eq(state))

filter_state(lf, "CA").collect()
statecityprice
strstri64
“CA”“San Francisco”16
“CA”“Los Angeles”20
“CA”“San Diego”18

In this example, the filter_state() function is decorated with @pa.check_types, which checks if the input and output DataFrames conform to the schema defined in the function annotations.

Conclusion

Pandera provides a simple and efficient way to validate Polars DataFrames. By defining a schema and using the validate() method or the check_types() decorator, you can ensure that your DataFrames conform to a specific structure and set of constraints. This can help prevent errors and make your code more robust and maintainable.

Link to Pandera.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran