Validating Polars DataFrames with Pandera

January 25, 2025

Validating Polars DataFrames with Pandera

Khuyen Tran

Pandera is a Python library that provides a simple and efficient way to validate pandas DataFrames. Recently, Pandera has added support for Polars, a fast and lightweight DataFrame library written in Rust. In this example, we will demonstrate how to use Pandera to validate Polars DataFrames.

New to Polars? Before diving into schema validation, check out my in-depth comparison Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames to understand how Polars differs from Pandas in terms of speed, architecture, and multi-core processing.

Defining a Schema

To validate a Polars DataFrame, we first need to define a schema using the pandera.polars module. A schema is a class that defines the structure and constraints of the DataFrame.

import pandera.polars as pa
import polars as pl

class Schema(pa.DataFrameModel):
    state: str = pa.Field(isin=["FL", "CA"])
    city: str
    price: int = pa.Field(in_range={"min_value": 5, "max_value": 20})

In this example, the schema defines three columns: state, city, and price. The price column has an additional constraint that its values must be between 5 and 20.

Validating a Polars DataFrame

Once we have defined the schema, we can validate a Polars DataFrame using the validate() method.

lf = pl.LazyFrame(
    {
        "state": ["FL", "FL", "FL", "CA", "CA", "CA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": [8, 12, 10, 16, 20, 18],
    }
)
Schema.validate(lf).collect()

state	city	price
str	str	i64
“FL”	“Orlando”	8
“FL”	“Miami”	12
“FL”	“Tampa”	10
“CA”	“San Francisco”	16
“CA”	“Los Angeles”	20
“CA”	“San Diego”	18

The validate() method checks if the DataFrame conforms to the schema and returns a new DataFrame with the validated data.

Using the check_types() Decorator

Pandera also provides a check_types() decorator that can be used to validate Polars DataFrame function annotations at runtime.

from pandera.typing.polars import LazyFrame

@pa.check_types
def filter_state(lf: LazyFrame[Schema], state: str) -> LazyFrame[Schema]:
    return lf.filter(pl.col("state").eq(state))

filter_state(lf, "CA").collect()

state	city	price
str	str	i64
“CA”	“San Francisco”	16
“CA”	“Los Angeles”	20
“CA”	“San Diego”	18

In this example, the filter_state() function is decorated with @pa.check_types, which checks if the input and output DataFrames conform to the schema defined in the function annotations.

To better understand why Polars is an excellent choice for efficient data processing and how it compares to Pandas, check out my in-depth article: Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames.

Conclusion

Pandera provides a simple and efficient way to validate Polars DataFrames. By defining a schema and using the validate() method or the check_types() decorator, you can ensure that your DataFrames conform to a specific structure and set of constraints. This can help prevent errors and make your code more robust and maintainable.

Link to Pandera.