Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark

Marco Gorelli

Introduction

Have you ever needed to convert a function to work with pandas, Polars, DuckDB, or PySpark DataFrames? If you were a data scientist in 2022, you might have just used pandas and called it a day.

from datetime import datetime
import pandas as pd

df = pd.DataFrame({
    "date": [datetime(2020, 1, 1), datetime(2020, 1, 8), datetime(2020, 2, 3)],
    "price": [1, 4, 3],
})

def monthly_aggregate_pandas(user_df):
    return user_df.resample("MS", on="date")[["price"]].mean()

monthly_aggregate_pandas(df)

But it’s 2025 now – if you try doing that today, you’ll quickly run into complaints:

Another team prefers using Polars.
Your lead data engineer wants to deploy using PySpark.
Another data engineer argues that DuckDB is all you need.
Your colleague would prefer using PyArrow due to its great interoperability.

Indeed, choosing a dataframe library is a common pain point. With all the dataframe libraries out there, each with its own API, how do you make a good, future-proof decision and avoid lock-in?

In particular, how do you write reusable and maintainable functions that can work with any major dataframe library?

This article will walk through the limitations of naive conversion, the complexity of maintaining separate logic for each DataFrame library, and how Narwhals offers a clean, unified way to express your DataFrame logic once and run it anywhere.

The source code of this article can be found here:

Source Code

Bad solution: convert all user input to pandas

You could make your tool appear dataframe-agnostic by just converting the user input to pandas.

import polars as pl
import duckdb
import pyarrow as pa
import pyspark
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

def monthly_aggregate_bad(user_df):
    if isinstance(user_df, pd.DataFrame):
        df = user_df
    elif isinstance(user_df, pl.DataFrame):
        df = user_df.to_pandas()
    elif isinstance(user_df, duckdb.DuckDBPyRelation):
        df = user_df.df()
    elif isinstance(user_df, pa.Table):
        df = user_df.to_pandas()
    elif isinstance(user_df, pyspark.sql.dataframe.DataFrame):
        df = user_df.toPandas()
    else:
        raise TypeError("Unsupported DataFrame type: cannot convert to pandas")
    return df.resample("MS", on="date")[["price"]].mean()

Use the monthly_aggregate_bad function for different types of DataFrames:

data = {
    "date": [datetime(2020, 1, 1), datetime(2020, 1, 8), datetime(2020, 2, 3)],
    "price": [1, 4, 3],
}

# pandas
pandas_df = pd.DataFrame(data)
monthly_aggregate_bad(pandas_df)

# polars
polars_df = pl.DataFrame(data)
monthly_aggregate_bad(polars_df)

# duckdb
duckdb_df = duckdb.from_df(pandas_df)
monthly_aggregate_bad(duckdb_df)

# pyspark
spark = SparkSession.builder.getOrCreate()
spark_df = spark.createDataFrame(pandas_df)
monthly_aggregate_bad(spark_df)

# pyarrow
arrow_table = pa.table(data)
monthly_aggregate_bad(arrow_table)

However, this is a missed opportunity, as you may lose out on:

lazy API optimisations.
GPU acceleration.
Multithreading.
Pandas becomes a required dependency for everyone.

If you want to appease your data engineers, you’ll need to support modern data tools natively.

Unmaintainable solution: write separate code for all input libraries

Having decided that you need to support pandas, Polars, PySpark, DuckDB, and PyArrow natively, you may decide to write a separate branch for each input kind:

def monthly_aggregate_unmaintainable(user_df):
    if isinstance(user_df, pd.DataFrame):
        result = user_df.resample("MS", on="date")[["price"]].mean()
    elif isinstance(user_df, pl.DataFrame):
        result = (
            user_df.group_by(pl.col("date").dt.truncate("1mo"))
            .agg(pl.col("price").mean())
            .sort("date")
        )
    elif isinstance(user_df, pyspark.sql.dataframe.DataFrame):
        result = (
            user_df.withColumn("date_month", F.date_trunc("month", F.col("date")))
           .groupBy("date_month")
           .agg(F.mean("price").alias("price_mean"))
           .orderBy("date_month")
        )
    # TODO: more branches for DuckDB, PyArrow, Dask, etc... :sob:
    return result

Then use the monthly_aggregate_unmaintainable function for different types of DataFrames:

# pandas
monthly_aggregate_unmaintainable(pandas_df)

# polars
monthly_aggregate_unmaintainable(polars_df)

# pyspark
monthly_aggregate_unmaintainable(spark_df)

Maintaining separate code for each DataFrame library quickly becomes unmanageable. Every new library introduces more branching logic, more surface area for bugs, and more overhead when requirements change. Surely, there’s a better way?

Best solution: express your logic once using Narwhals

Narwhals is an extremely lightweight compatibility layer between dataframes and is used by Plotly, Marimo, Altair, Bokeh, and more. It allows you to express dataframe logic just once, with a unified API. Using Narwhals, the complicated code above becomes:

import narwhals as nw
from narwhals.typing import IntoFrameT

def monthly_aggregate(user_df: IntoFrameT) -> IntoFrameT:
    return (
        nw.from_native(user_df)
        .group_by(nw.col("date").dt.truncate("1mo"))
        .agg(nw.col("price").mean())
        .sort("date")
        .to_native()
    )

Use the monthly_aggregate function for different types of DataFrames:

# pandas
monthly_aggregate(pandas_df)

# polars
monthly_aggregate(polars_df)

# duckdb
monthly_aggregate(duckdb_df)

# pyarrow
monthly_aggregate(arrow_table)

# pyspark
monthly_aggregate(spark_df)

Much simpler! Code written like this can accept inputs from all major dataframe libraries, without any extra required dependencies! The user brings their own dataframe and gets their result. It also addresses other pain points faced by data science tool builders:

Full static typing.
Strong backwards-compatibility promises.
Minimal overhead.

Careful readers may have noticed that this looks a lot like the Polars solution. Indeed, the Narwhals API is a subset of the Polars API. Check the Narwhals documentation for more examples and tutorials.

What happens when libraries evolve?

Library APIs change over time—functions get deprecated, method signatures shift, or behavior becomes inconsistent across versions. Narwhals is built to absorb that churn by staying compatible with older and newer versions of the libraries it wraps. This means if you write a function with Narwhals today, it’s far more likely to continue working tomorrow, and no rewrites are required.

If you want to go further and guard against changes in Narwhals itself, you can use its stable API, which, like Rust’s Editions, is intended to remain indefinitely backwards compatible.

Conclusion

We’ve looked at how to write reusable and maintainable data science functions that support all major dataframe libraries. Keeping code maintainable in the face of all the DataFrame libraries is a common pain point for data scientists. Rather than just converting everything to pandas, a better solution is to use Narwhals as a unified dataframe interface. Next time you write a data science function and want to avoid dataframe library lock-in, Narwhals is your friend!

This blog post was contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.

Writing Portable DataFrame Code for Eager and Lazy Execution

May 25, 2025

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

April 21, 2025

Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark

Table of Contents

Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark

Marco Gorelli

Introduction

Bad solution: convert all user input to pandas

Unmaintainable solution: write separate code for all input libraries

Best solution: express your logic once using Narwhals

What happens when libraries evolve?

Conclusion

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark

Table of Contents

Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark

Marco Gorelli

Introduction

Bad solution: convert all user input to pandas

Unmaintainable solution: write separate code for all input libraries

Best solution: express your logic once using Narwhals

What happens when libraries evolve?

Conclusion

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut