Pandera is a Python library that provides a simple and efficient way to validate pandas DataFrames. Recently, Pandera has added support for Polars, a fast and lightweight DataFrame library written in Rust. In this example, we will demonstrate how to use Pandera to validate Polars DataFrames.
New to Polars? Before diving into schema validation, check out my in-depth comparison Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames to understand how Polars differs from Pandas in terms of speed, architecture, and multi-core processing.
Defining a Schema
To validate a Polars DataFrame, we first need to define a schema using the pandera.polars
module. A schema is a class that defines the structure and constraints of the DataFrame.
import pandera.polars as pa
import polars as pl
class Schema(pa.DataFrameModel):
state: str = pa.Field(isin=["FL", "CA"])
city: str
price: int = pa.Field(in_range={"min_value": 5, "max_value": 20})
In this example, the schema defines three columns: state
, city
, and price
. The price
column has an additional constraint that its values must be between 5 and 20.
Validating a Polars DataFrame
Once we have defined the schema, we can validate a Polars DataFrame using the validate()
method.
lf = pl.LazyFrame(
{
"state": ["FL", "FL", "FL", "CA", "CA", "CA"],
"city": [
"Orlando",
"Miami",
"Tampa",
"San Francisco",
"Los Angeles",
"San Diego",
],
"price": [8, 12, 10, 16, 20, 18],
}
)
Schema.validate(lf).collect()
state | city | price |
---|---|---|
str | str | i64 |
“FL” | “Orlando” | 8 |
“FL” | “Miami” | 12 |
“FL” | “Tampa” | 10 |
“CA” | “San Francisco” | 16 |
“CA” | “Los Angeles” | 20 |
“CA” | “San Diego” | 18 |
The validate()
method checks if the DataFrame conforms to the schema and returns a new DataFrame with the validated data.
Using the check_types() Decorator
Pandera also provides a check_types()
decorator that can be used to validate Polars DataFrame function annotations at runtime.
from pandera.typing.polars import LazyFrame
@pa.check_types
def filter_state(lf: LazyFrame[Schema], state: str) -> LazyFrame[Schema]:
return lf.filter(pl.col("state").eq(state))
filter_state(lf, "CA").collect()
state | city | price |
---|---|---|
str | str | i64 |
“CA” | “San Francisco” | 16 |
“CA” | “Los Angeles” | 20 |
“CA” | “San Diego” | 18 |
In this example, the filter_state()
function is decorated with @pa.check_types
, which checks if the input and output DataFrames conform to the schema defined in the function annotations.
To better understand why Polars is an excellent choice for efficient data processing and how it compares to Pandas, check out my in-depth article: Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames.
Conclusion
Pandera provides a simple and efficient way to validate Polars DataFrames. By defining a schema and using the validate()
method or the check_types()
decorator, you can ensure that your DataFrames conform to a specific structure and set of constraints. This can help prevent errors and make your code more robust and maintainable.
Want the full walkthrough?
Check out our in-depth guide on Polars vs Pandas: A Fast, Multi-Core Alternative for DataFrames