Pandera is a Python library that provides a simple and efficient way to validate pandas DataFrames. Recently, Pandera has added support for Polars, a fast and lightweight DataFrame library written in Rust. In this example, we will demonstrate how to use Pandera to validate Polars DataFrames.
Defining a Schema
To validate a Polars DataFrame, we first need to define a schema using the pandera.polars
module. A schema is a class that defines the structure and constraints of the DataFrame.
import pandera.polars as pa
import polars as pl
class Schema(pa.DataFrameModel):
state: str = pa.Field(isin=["FL", "CA"])
city: str
price: int = pa.Field(in_range={"min_value": 5, "max_value": 20})
In this example, the schema defines three columns: state
, city
, and price
. The price
column has an additional constraint that its values must be between 5 and 20.
Validating a Polars DataFrame
Once we have defined the schema, we can validate a Polars DataFrame using the validate()
method.
lf = pl.LazyFrame(
{
"state": ["FL", "FL", "FL", "CA", "CA", "CA"],
"city": [
"Orlando",
"Miami",
"Tampa",
"San Francisco",
"Los Angeles",
"San Diego",
],
"price": [8, 12, 10, 16, 20, 18],
}
)
Schema.validate(lf).collect()
state | city | price |
---|---|---|
str | str | i64 |
“FL” | “Orlando” | 8 |
“FL” | “Miami” | 12 |
“FL” | “Tampa” | 10 |
“CA” | “San Francisco” | 16 |
“CA” | “Los Angeles” | 20 |
“CA” | “San Diego” | 18 |
The validate()
method checks if the DataFrame conforms to the schema and returns a new DataFrame with the validated data.
Using the check_types() Decorator
Pandera also provides a check_types()
decorator that can be used to validate Polars DataFrame function annotations at runtime.
from pandera.typing.polars import LazyFrame
@pa.check_types
def filter_state(lf: LazyFrame[Schema], state: str) -> LazyFrame[Schema]:
return lf.filter(pl.col("state").eq(state))
filter_state(lf, "CA").collect()
state | city | price |
---|---|---|
str | str | i64 |
“CA” | “San Francisco” | 16 |
“CA” | “Los Angeles” | 20 |
“CA” | “San Diego” | 18 |
In this example, the filter_state()
function is decorated with @pa.check_types
, which checks if the input and output DataFrames conform to the schema defined in the function annotations.
Conclusion
Pandera provides a simple and efficient way to validate Polars DataFrames. By defining a schema and using the validate()
method or the check_types()
decorator, you can ensure that your DataFrames conform to a specific structure and set of constraints. This can help prevent errors and make your code more robust and maintainable.