Writing Portable DataFrame Code for Eager and Lazy Execution

Marco Gorelli

Have you ever written a data pipeline with pandas and then got an OutOfMemoryError when you put your full dataset through it? That’s because pandas is an eager dataframe which loads all data into memory, leading to issues with large datasets. Lazy execution avoids this by deferring computation, which reduces memory usage and boosts performance.

You may have heard about Polars and DuckDB, but are hesitant to try them. Not only would you need to update your syntax, you’d also have to adjust your solution’s logic based on some differences between them:

Eager dataframes (like pandas) are in-memory and have a defined row order.
Lazy dataframes (like DuckDB or polars.LazyFrame) defer their execution and make fewer guarantees (if any!) about row order. Without specifying order, the same logic can produce inconsistent results in operations like cum_sum(), as lazy execution may shuffle rows unpredictably between runs:

import polars as pl

polars_df1 = pl.DataFrame({"store": [1, 1, 2], "date_id": [4, 5, 6]}).lazy()
polars_df2 = pl.DataFrame({"store": [1, 2], "sales": [7, 8]}).lazy()

# Run the join operation twice to see if the outputs are the same
for _ in range(2):
    print(
        polars_df1.join(polars_df2, on="store", how="left")
        # no order is specified
        .with_columns(cumulative_sales=pl.col("sales").cum_sum().over("store"))
        .collect(engine="streaming")
    )

Output:

shape: (3, 4)
┌───────┬─────────┬───────┬──────────────────┐
│ store ┆ date_id ┆ sales ┆ cumulative_sales │
│ ---   ┆ ---     ┆ ---   ┆ ---              │
│ i64   ┆ i64     ┆ i64   ┆ i64              │
╞═══════╪═════════╪═══════╪══════════════════╡
│ 1     ┆ 4       ┆ 7     ┆ 7                │
│ 2     ┆ 6       ┆ 8     ┆ 8                │
│ 1     ┆ 5       ┆ 7     ┆ 14               │
└───────┴─────────┴───────┴──────────────────┘
shape: (3, 4)
┌───────┬─────────┬───────┬──────────────────┐
│ store ┆ date_id ┆ sales ┆ cumulative_sales │
│ ---   ┆ ---     ┆ ---   ┆ ---              │
│ i64   ┆ i64     ┆ i64   ┆ i64              │
╞═══════╪═════════╪═══════╪══════════════════╡
│ 1     ┆ 5       ┆ 7     ┆ 7                │
│ 1     ┆ 4       ┆ 7     ┆ 14               │
│ 2     ┆ 6       ┆ 8     ┆ 8                │
└───────┴─────────┴───────┴──────────────────┘

In the code above, the same lazy pipeline is run twice—but the output differs between runs.

The good news is that, by learning one simple trick, you can rid your code of row-ordering assumptions and have it work seamlessly across eager and lazy dataframes.

In this article, we’ll use Narwhals to:

Create a fill-forward solution that works with eager pandas and Polars dataframes
Upgrade it to also support lazy dataframes like polars.LazyFrame, DuckDB, and PySpark

The source code of this article can be found here:

Source Code

Eager-Only Solution

First, we’ll build a fill-forward solution compatible with eager pandas and Polars dataframes.

Suppose we have a dataset of store sales over time:

from datetime import datetime

data = {
    'sale_date': [
        datetime(2025, 5, 22),
        datetime(2025, 5, 23),
        datetime(2025, 5, 24),
        datetime(2025, 5, 22),
        datetime(2025, 5, 23),
        datetime(2025, 5, 24),
    ],
    'store': [
        'Thimphu',
        'Thimphu',
        'Thimphu',
        'Paro',
        'Paro',
        'Paro',
    ],
    'sales': [1100, None, 1450, 501, 500, None],
}

Note that for the ‘sales’ column, we have some missing values. Let’s write a dataframe-agnostic function to forward-fill missing prices using the previous valid value from the same store.

To avoid locking ourselves into either pandas or Polars, we’ll use Narwhals (see our previous post on Narwhals for how to get started with it):

import narwhals as nw
from narwhals.typing import IntoFrameT


def agnostic_ffill_by_store(df_native: IntoFrameT) -> IntoFrameT:
    return (
        nw.from_native(df_native)
        .with_columns(nw.col("sales").fill_null(strategy="forward").over("store"))
        .to_native()
    )

You can verify that this works for both pandas.DataFrame and polars.DataFrame:

import polars as pl
import pandas as pd
import pyarrow as pa

# pandas.DataFrame
df_pandas = pd.DataFrame(data)
agnostic_ffill_by_store(df_pandas)

Output:

   sale_date    store   sales
0 2025-05-22  Thimphu  1100.0
1 2025-05-23  Thimphu  1100.0
2 2025-05-24  Thimphu  1450.0
3 2025-05-22     Paro   501.0
4 2025-05-23     Paro   500.0
5 2025-05-24     Paro   500.0

# polars.DataFrame
df_polars = pl.DataFrame(data)
print(agnostic_ffill_by_store(df_polars))

Output:

shape: (6, 3)
┌─────────────────────┬─────────┬───────┐
│ sale_date           ┆ store   ┆ sales │
│ ---                 ┆ ---     ┆ ---   │
│ datetime[μs]        ┆ str     ┆ i64   │
╞═════════════════════╪═════════╪═══════╡
│ 2025-05-22 00:00:00 ┆ Thimphu ┆ 1100  │
│ 2025-05-23 00:00:00 ┆ Thimphu ┆ 1100  │
│ 2025-05-24 00:00:00 ┆ Thimphu ┆ 1450  │
│ 2025-05-22 00:00:00 ┆ Paro    ┆ 501   │
│ 2025-05-23 00:00:00 ┆ Paro    ┆ 500   │
│ 2025-05-24 00:00:00 ┆ Paro    ┆ 500   │
└─────────────────────┴─────────┴───────┘

Eager and Lazy Solution

But what if we also want it to support polars.LazyFrame, duckdb.DuckDBPyRelation, and pyspark.sql.DataFrame? If we try passing any of those to agnostic_ffill_by_store, then Narwhals will raise an error:

import duckdb

duckdb_rel = duckdb.table('df_polars')
agnostic_ffill_by_store(duckdb_rel)

narwhals.exceptions.OrderDependentExprError: Order-dependent expressions are not supported for use in LazyFrame.

Hint: To make the expression valid, use `.over` with `order_by` specified.

For example, if you wrote `nw.col('price').cum_sum()` and you have a column
`'date'` which orders your data, then replace:

   nw.col('price').cum_sum()

 with:

   nw.col('price').cum_sum().over(order_by='date')
                            ^^^^^^^^^^^^^^^^^^^^^^

See https://narwhals-dev.github.io/narwhals/concepts/order_dependence/.

This error happens because lazy DataFrames don’t preserve the order of rows unless explicitly told to do so. Thus, operations like fill_null(strategy='forward') may yield unpredictable results unless you specify an ordering column explicitly.

To fix this, we must explicitly define how the rows should be ordered. Since our dataset has a sale_date column that represents time, it’s a natural choice for ordering forward-fill operations.

def agnostic_ffill_by_store_improved(df_native: IntoFrameT) -> IntoFrameT:
    return (
        nw.from_native(df_native)
        .with_columns(
            nw.col("sales")
            .fill_null(strategy="forward")
            # Note the `order_by` statement
            .over("store", order_by="sale_date")
        )
        .to_native()
    )

And now – voilà – it also supports DuckDB and Polars LazyFrame!

agnostic_ffill_by_store_improved(duckdb_rel)

┌─────────────────────┬─────────┬───────┐
│      sale_date      │  store  │ sales │
│      timestamp      │ varchar │ int64 │
├─────────────────────┼─────────┼───────┤
│ 2025-05-22 00:00:00 │ Paro    │   501 │
│ 2025-05-23 00:00:00 │ Paro    │   500 │
│ 2025-05-24 00:00:00 │ Paro    │   500 │
│ 2025-05-22 00:00:00 │ Thimphu │  1100 │
│ 2025-05-23 00:00:00 │ Thimphu │  1100 │
│ 2025-05-24 00:00:00 │ Thimphu │  1450 │
└─────────────────────┴─────────┴───────┘

agnostic_ffill_by_store_improved(df_polars.lazy()).collect()

┌─────────────────────┬─────────┬───────┐
│ sale_date           ┆ store   ┆ sales │
│ ---                 ┆ ---     ┆ ---   │
│ datetime[μs]        ┆ str     ┆ i64   │
╞═════════════════════╪═════════╪═══════╡
│ 2025-05-22 00:00:00 ┆ Thimphu ┆ 1100  │
│ 2025-05-23 00:00:00 ┆ Thimphu ┆ 1100  │
│ 2025-05-24 00:00:00 ┆ Thimphu ┆ 1450  │
│ 2025-05-22 00:00:00 ┆ Paro    ┆ 501   │
│ 2025-05-23 00:00:00 ┆ Paro    ┆ 500   │
│ 2025-05-24 00:00:00 ┆ Paro    ┆ 500   │
└─────────────────────┴─────────┴───────┘

Being forced to specify an ordering column for fill_null(strategy='forward') may feel like a restriction on the Polars API. However, if we want to write portable and maintainable code, we cannot make assumptions about whether our backend preserves row order, and so this is a restriction that enables portability!

Final thoughts

Eager execution works perfectly well…until it doesn’t. It’s hard to know in advance how much data your pipeline will need to handle. By writing dataframe-agnostic code which works across eager and lazy dataframes, you can avoid dataframe lock-in and seamlessly scale as and when you need to.

And what will your teammates think when you present them with reusable functions which they can use with whichever dataframe library they prefer? They’ll probably love you for it.

This blog post was contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.

Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark

May 15, 2025

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

April 21, 2025

Writing Portable DataFrame Code for Eager and Lazy Execution

Table of Contents

Writing Portable DataFrame Code for Eager and Lazy Execution

Marco Gorelli

Eager-Only Solution

Eager and Lazy Solution

Final thoughts

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Writing Portable DataFrame Code for Eager and Lazy Execution

Table of Contents

Writing Portable DataFrame Code for Eager and Lazy Execution

Marco Gorelli

Eager-Only Solution

Eager and Lazy Solution

Final thoughts

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut