Have you ever written a data pipeline with pandas and then got an OutOfMemoryError when you put your full dataset through it? Thatโs because pandas is an eager dataframe which loads all data into memory, leading to issues with large datasets. Lazy execution avoids this by deferring computation, which reduces memory usage and boosts performance.
You may have heard about Polars and DuckDB, but are hesitant to try them. Not only would you need to update your syntax, youโd also have to adjust your solutionโs logic based on some differences between them:
- Eager dataframes (like pandas) are in-memory and have a defined row order.
- Lazy dataframes (like DuckDB or
polars.LazyFrame) defer their execution and make fewer guarantees (if any!) about row order. Without specifying order, the same logic can produce inconsistent results in operations likecum_sum(), as lazy execution may shuffle rows unpredictably between runs:
import polars as pl
polars_df1 = pl.DataFrame({"store": [1, 1, 2], "date_id": [4, 5, 6]}).lazy()
polars_df2 = pl.DataFrame({"store": [1, 2], "sales": [7, 8]}).lazy()
# Run the join operation twice to see if the outputs are the same
for _ in range(2):
print(
polars_df1.join(polars_df2, on="store", how="left")
# no order is specified
.with_columns(cumulative_sales=pl.col("sales").cum_sum().over("store"))
.collect(engine="streaming")
) Output:
shape: (3, 4)
โโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ store โ date_id โ sales โ cumulative_sales โ
โ --- โ --- โ --- โ --- โ
โ i64 โ i64 โ i64 โ i64 โ
โโโโโโโโโชโโโโโโโโโโชโโโโโโโโชโโโโโโโโโโโโโโโโโโโก
โ 1 โ 4 โ 7 โ 7 โ
โ 2 โ 6 โ 8 โ 8 โ
โ 1 โ 5 โ 7 โ 14 โ
โโโโโโโโโดโโโโโโโโโโดโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
shape: (3, 4)
โโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ store โ date_id โ sales โ cumulative_sales โ
โ --- โ --- โ --- โ --- โ
โ i64 โ i64 โ i64 โ i64 โ
โโโโโโโโโชโโโโโโโโโโชโโโโโโโโชโโโโโโโโโโโโโโโโโโโก
โ 1 โ 5 โ 7 โ 7 โ
โ 1 โ 4 โ 7 โ 14 โ
โ 2 โ 6 โ 8 โ 8 โ
โโโโโโโโโดโโโโโโโโโโดโโโโโโโโดโโโโโโโโโโโโโโโโโโโ In the code above, the same lazy pipeline is run twiceโbut the output differs between runs.
The good news is that, by learning one simple trick, you can rid your code of row-ordering assumptions and have it work seamlessly across eager and lazy dataframes.
In this article, weโll use Narwhals to:
- Create a fill-forward solution that works with eager pandas and Polars dataframes
- Upgrade it to also support lazy dataframes like polars.LazyFrame, DuckDB, and PySpark
The source code of this article can be found here:
Eager-Only Solution
First, weโll build a fill-forward solution compatible with eager pandas and Polars dataframes.
Suppose we have a dataset of store sales over time:
from datetime import datetime
data = {
'sale_date': [
datetime(2025, 5, 22),
datetime(2025, 5, 23),
datetime(2025, 5, 24),
datetime(2025, 5, 22),
datetime(2025, 5, 23),
datetime(2025, 5, 24),
],
'store': [
'Thimphu',
'Thimphu',
'Thimphu',
'Paro',
'Paro',
'Paro',
],
'sales': [1100, None, 1450, 501, 500, None],
} Note that for the โsalesโ column, we have some missing values. Letโs write a dataframe-agnostic function to forward-fill missing prices using the previous valid value from the same store.
To avoid locking ourselves into either pandas or Polars, weโll use Narwhals (see our previous post on Narwhals for how to get started with it):
import narwhals as nw
from narwhals.typing import IntoFrameT
def agnostic_ffill_by_store(df_native: IntoFrameT) -> IntoFrameT:
return (
nw.from_native(df_native)
.with_columns(nw.col("sales").fill_null(strategy="forward").over("store"))
.to_native()
) You can verify that this works for both pandas.DataFrame and polars.DataFrame:
import polars as pl
import pandas as pd
import pyarrow as pa
# pandas.DataFrame
df_pandas = pd.DataFrame(data)
agnostic_ffill_by_store(df_pandas) Output:
sale_date store sales
0 2025-05-22 Thimphu 1100.0
1 2025-05-23 Thimphu 1100.0
2 2025-05-24 Thimphu 1450.0
3 2025-05-22 Paro 501.0
4 2025-05-23 Paro 500.0
5 2025-05-24 Paro 500.0 # polars.DataFrame
df_polars = pl.DataFrame(data)
print(agnostic_ffill_by_store(df_polars)) Output:
shape: (6, 3)
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโ
โ sale_date โ store โ sales โ
โ --- โ --- โ --- โ
โ datetime[ฮผs] โ str โ i64 โ
โโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโก
โ 2025-05-22 00:00:00 โ Thimphu โ 1100 โ
โ 2025-05-23 00:00:00 โ Thimphu โ 1100 โ
โ 2025-05-24 00:00:00 โ Thimphu โ 1450 โ
โ 2025-05-22 00:00:00 โ Paro โ 501 โ
โ 2025-05-23 00:00:00 โ Paro โ 500 โ
โ 2025-05-24 00:00:00 โ Paro โ 500 โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโ Eager and Lazy Solution
But what if we also want it to support polars.LazyFrame, duckdb.DuckDBPyRelation, and pyspark.sql.DataFrame? If we try passing any of those to agnostic_ffill_by_store, then Narwhals will raise an error:
import duckdb
duckdb_rel = duckdb.table('df_polars')
agnostic_ffill_by_store(duckdb_rel) narwhals.exceptions.OrderDependentExprError: Order-dependent expressions are not supported for use in LazyFrame.
Hint: To make the expression valid, use `.over` with `order_by` specified.
For example, if you wrote `nw.col('price').cum_sum()` and you have a column
`'date'` which orders your data, then replace:
nw.col('price').cum_sum()
with:
nw.col('price').cum_sum().over(order_by='date')
^^^^^^^^^^^^^^^^^^^^^^
See https://narwhals-dev.github.io/narwhals/concepts/order_dependence/. This error happens because lazy DataFrames donโt preserve the order of rows unless explicitly told to do so. Thus, operations like fill_null(strategy='forward') may yield unpredictable results unless you specify an ordering column explicitly.
To fix this, we must explicitly define how the rows should be ordered. Since our dataset has a sale_date column that represents time, itโs a natural choice for ordering forward-fill operations.
def agnostic_ffill_by_store_improved(df_native: IntoFrameT) -> IntoFrameT:
return (
nw.from_native(df_native)
.with_columns(
nw.col("sales")
.fill_null(strategy="forward")
# Note the `order_by` statement
.over("store", order_by="sale_date")
)
.to_native()
) And now โ voilร โ it also supports DuckDB and Polars LazyFrame!
agnostic_ffill_by_store_improved(duckdb_rel) โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโ
โ sale_date โ store โ sales โ
โ timestamp โ varchar โ int64 โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโค
โ 2025-05-22 00:00:00 โ Paro โ 501 โ
โ 2025-05-23 00:00:00 โ Paro โ 500 โ
โ 2025-05-24 00:00:00 โ Paro โ 500 โ
โ 2025-05-22 00:00:00 โ Thimphu โ 1100 โ
โ 2025-05-23 00:00:00 โ Thimphu โ 1100 โ
โ 2025-05-24 00:00:00 โ Thimphu โ 1450 โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโ agnostic_ffill_by_store_improved(df_polars.lazy()).collect() โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโ
โ sale_date โ store โ sales โ
โ --- โ --- โ --- โ
โ datetime[ฮผs] โ str โ i64 โ
โโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโก
โ 2025-05-22 00:00:00 โ Thimphu โ 1100 โ
โ 2025-05-23 00:00:00 โ Thimphu โ 1100 โ
โ 2025-05-24 00:00:00 โ Thimphu โ 1450 โ
โ 2025-05-22 00:00:00 โ Paro โ 501 โ
โ 2025-05-23 00:00:00 โ Paro โ 500 โ
โ 2025-05-24 00:00:00 โ Paro โ 500 โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโ Being forced to specify an ordering column for fill_null(strategy='forward') may feel like a restriction on the Polars API. However, if we want to write portable and maintainable code, we cannot make assumptions about whether our backend preserves row order, and so this is a restriction that enables portability!
Final thoughts
Eager execution works perfectly wellโฆuntil it doesnโt. Itโs hard to know in advance how much data your pipeline will need to handle. By writing dataframe-agnostic code which works across eager and lazy dataframes, you can avoid dataframe lock-in and seamlessly scale as and when you need to.
And what will your teammates think when you present them with reusable functions which they can use with whichever dataframe library they prefer? Theyโll probably love you for it.
This blog post was contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.

