Validating Polars DataFrames with Pandera

Validating Polars DataFrames with Pandera

Pandera is a Python library that provides a simple and efficient way to validate pandas DataFrames. Recently, Pandera has added support for Polars, a fast and lightweight DataFrame library written in Rust. In this example, we will demonstrate how to use Pandera to validate Polars DataFrames.

Defining a Schema

To validate a Polars DataFrame, we first need to define a schema using the pandera.polars module. A schema is a class that defines the structure and constraints of the DataFrame.

import pandera.polars as pa
import polars as pl

class Schema(pa.DataFrameModel):
    state: str = pa.Field(isin=["FL", "CA"])
    city: str
    price: int = pa.Field(in_range={"min_value": 5, "max_value": 20})

In this example, the schema defines three columns: state, city, and price. The price column has an additional constraint that its values must be between 5 and 20.

Validating a Polars DataFrame

Once we have defined the schema, we can validate a Polars DataFrame using the validate() method.

lf = pl.LazyFrame(
    {
        "state": ["FL", "FL", "FL", "CA", "CA", "CA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": [8, 12, 10, 16, 20, 18],
    }
)
Schema.validate(lf).collect()
statecityprice
strstri64
“FL”“Orlando”8
“FL”“Miami”12
“FL”“Tampa”10
“CA”“San Francisco”16
“CA”“Los Angeles”20
“CA”“San Diego”18

The validate() method checks if the DataFrame conforms to the schema and returns a new DataFrame with the validated data.

Using the check_types() Decorator

Pandera also provides a check_types() decorator that can be used to validate Polars DataFrame function annotations at runtime.

from pandera.typing.polars import LazyFrame

@pa.check_types
def filter_state(lf: LazyFrame[Schema], state: str) -> LazyFrame[Schema]:
    return lf.filter(pl.col("state").eq(state))

filter_state(lf, "CA").collect()
statecityprice
strstri64
“CA”“San Francisco”16
“CA”“Los Angeles”20
“CA”“San Diego”18

In this example, the filter_state() function is decorated with @pa.check_types, which checks if the input and output DataFrames conform to the schema defined in the function annotations.

Conclusion

Pandera provides a simple and efficient way to validate Polars DataFrames. By defining a schema and using the validate() method or the check_types() decorator, you can ensure that your DataFrames conform to a specific structure and set of constraints. This can help prevent errors and make your code more robust and maintainable.

Link to Pandera.

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran