Have you ever written a data pipeline with pandas and then got an OutOfMemoryError
when you put your full dataset through it? That’s because pandas is an eager dataframe which loads all data into memory, leading to issues with large datasets. Lazy execution avoids this by deferring computation, which reduces memory usage and boosts performance.
You may have heard about Polars and DuckDB, but are hesitant to try them. Not only would you need to update your syntax, you’d also have to adjust your solution’s logic based on some differences between them:
- Eager dataframes (like pandas) are in-memory and have a defined row order.
- Lazy dataframes (like DuckDB or
polars.LazyFrame
) defer their execution and make fewer guarantees (if any!) about row order. Without specifying order, the same logic can produce inconsistent results in operations likecum_sum()
, as lazy execution may shuffle rows unpredictably between runs:
import polars as pl
polars_df1 = pl.DataFrame({"store": [1, 1, 2], "date_id": [4, 5, 6]}).lazy()
polars_df2 = pl.DataFrame({"store": [1, 2], "sales": [7, 8]}).lazy()
# Run the join operation twice to see if the outputs are the same
for _ in range(2):
print(
polars_df1.join(polars_df2, on="store", how="left")
# no order is specified
.with_columns(cumulative_sales=pl.col("sales").cum_sum().over("store"))
.collect(engine="streaming")
)
Output:
shape: (3, 4)
┌───────┬─────────┬───────┬──────────────────┐
│ store ┆ date_id ┆ sales ┆ cumulative_sales │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═════════╪═══════╪══════════════════╡
│ 1 ┆ 4 ┆ 7 ┆ 7 │
│ 2 ┆ 6 ┆ 8 ┆ 8 │
│ 1 ┆ 5 ┆ 7 ┆ 14 │
└───────┴─────────┴───────┴──────────────────┘
shape: (3, 4)
┌───────┬─────────┬───────┬──────────────────┐
│ store ┆ date_id ┆ sales ┆ cumulative_sales │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═════════╪═══════╪══════════════════╡
│ 1 ┆ 5 ┆ 7 ┆ 7 │
│ 1 ┆ 4 ┆ 7 ┆ 14 │
│ 2 ┆ 6 ┆ 8 ┆ 8 │
└───────┴─────────┴───────┴──────────────────┘
In the code above, the same lazy pipeline is run twice—but the output differs between runs.
The good news is that, by learning one simple trick, you can rid your code of row-ordering assumptions and have it work seamlessly across eager and lazy dataframes.
In this article, we’ll use Narwhals to:
- Create a fill-forward solution that works with eager pandas and Polars dataframes
- Upgrade it to also support lazy dataframes like polars.LazyFrame, DuckDB, and PySpark
The source code of this article can be found here:
Eager-Only Solution
First, we’ll build a fill-forward solution compatible with eager pandas and Polars dataframes.
Suppose we have a dataset of store sales over time:
from datetime import datetime
data = {
'sale_date': [
datetime(2025, 5, 22),
datetime(2025, 5, 23),
datetime(2025, 5, 24),
datetime(2025, 5, 22),
datetime(2025, 5, 23),
datetime(2025, 5, 24),
],
'store': [
'Thimphu',
'Thimphu',
'Thimphu',
'Paro',
'Paro',
'Paro',
],
'sales': [1100, None, 1450, 501, 500, None],
}
Note that for the ‘sales’ column, we have some missing values. Let’s write a dataframe-agnostic function to forward-fill missing prices using the previous valid value from the same store.
To avoid locking ourselves into either pandas or Polars, we’ll use Narwhals (see our previous post on Narwhals for how to get started with it):
import narwhals as nw
from narwhals.typing import IntoFrameT
def agnostic_ffill_by_store(df_native: IntoFrameT) -> IntoFrameT:
return (
nw.from_native(df_native)
.with_columns(nw.col("sales").fill_null(strategy="forward").over("store"))
.to_native()
)
You can verify that this works for both pandas.DataFrame
and polars.DataFrame
:
import polars as pl
import pandas as pd
import pyarrow as pa
# pandas.DataFrame
df_pandas = pd.DataFrame(data)
agnostic_ffill_by_store(df_pandas)
Output:
sale_date store sales
0 2025-05-22 Thimphu 1100.0
1 2025-05-23 Thimphu 1100.0
2 2025-05-24 Thimphu 1450.0
3 2025-05-22 Paro 501.0
4 2025-05-23 Paro 500.0
5 2025-05-24 Paro 500.0
# polars.DataFrame
df_polars = pl.DataFrame(data)
print(agnostic_ffill_by_store(df_polars))
Output:
shape: (6, 3)
┌─────────────────────┬─────────┬───────┐
│ sale_date ┆ store ┆ sales │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═════════╪═══════╡
│ 2025-05-22 00:00:00 ┆ Thimphu ┆ 1100 │
│ 2025-05-23 00:00:00 ┆ Thimphu ┆ 1100 │
│ 2025-05-24 00:00:00 ┆ Thimphu ┆ 1450 │
│ 2025-05-22 00:00:00 ┆ Paro ┆ 501 │
│ 2025-05-23 00:00:00 ┆ Paro ┆ 500 │
│ 2025-05-24 00:00:00 ┆ Paro ┆ 500 │
└─────────────────────┴─────────┴───────┘
Eager and Lazy Solution
But what if we also want it to support polars.LazyFrame
, duckdb.DuckDBPyRelation
, and pyspark.sql.DataFrame
? If we try passing any of those to agnostic_ffill_by_store
, then Narwhals will raise an error:
import duckdb
duckdb_rel = duckdb.table('df_polars')
agnostic_ffill_by_store(duckdb_rel)
narwhals.exceptions.OrderDependentExprError: Order-dependent expressions are not supported for use in LazyFrame.
Hint: To make the expression valid, use `.over` with `order_by` specified.
For example, if you wrote `nw.col('price').cum_sum()` and you have a column
`'date'` which orders your data, then replace:
nw.col('price').cum_sum()
with:
nw.col('price').cum_sum().over(order_by='date')
^^^^^^^^^^^^^^^^^^^^^^
See https://narwhals-dev.github.io/narwhals/concepts/order_dependence/.
This error happens because lazy DataFrames don’t preserve the order of rows unless explicitly told to do so. Thus, operations like fill_null(strategy='forward')
may yield unpredictable results unless you specify an ordering column explicitly.
To fix this, we must explicitly define how the rows should be ordered. Since our dataset has a sale_date
column that represents time, it’s a natural choice for ordering forward-fill operations.
def agnostic_ffill_by_store_improved(df_native: IntoFrameT) -> IntoFrameT:
return (
nw.from_native(df_native)
.with_columns(
nw.col("sales")
.fill_null(strategy="forward")
# Note the `order_by` statement
.over("store", order_by="sale_date")
)
.to_native()
)
And now – voilà – it also supports DuckDB and Polars LazyFrame!
agnostic_ffill_by_store_improved(duckdb_rel)
┌─────────────────────┬─────────┬───────┐
│ sale_date │ store │ sales │
│ timestamp │ varchar │ int64 │
├─────────────────────┼─────────┼───────┤
│ 2025-05-22 00:00:00 │ Paro │ 501 │
│ 2025-05-23 00:00:00 │ Paro │ 500 │
│ 2025-05-24 00:00:00 │ Paro │ 500 │
│ 2025-05-22 00:00:00 │ Thimphu │ 1100 │
│ 2025-05-23 00:00:00 │ Thimphu │ 1100 │
│ 2025-05-24 00:00:00 │ Thimphu │ 1450 │
└─────────────────────┴─────────┴───────┘
agnostic_ffill_by_store_improved(df_polars.lazy()).collect()
┌─────────────────────┬─────────┬───────┐
│ sale_date ┆ store ┆ sales │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═════════╪═══════╡
│ 2025-05-22 00:00:00 ┆ Thimphu ┆ 1100 │
│ 2025-05-23 00:00:00 ┆ Thimphu ┆ 1100 │
│ 2025-05-24 00:00:00 ┆ Thimphu ┆ 1450 │
│ 2025-05-22 00:00:00 ┆ Paro ┆ 501 │
│ 2025-05-23 00:00:00 ┆ Paro ┆ 500 │
│ 2025-05-24 00:00:00 ┆ Paro ┆ 500 │
└─────────────────────┴─────────┴───────┘
Being forced to specify an ordering column for fill_null(strategy='forward')
may feel like a restriction on the Polars API. However, if we want to write portable and maintainable code, we cannot make assumptions about whether our backend preserves row order, and so this is a restriction that enables portability!
Final thoughts
Eager execution works perfectly well…until it doesn’t. It’s hard to know in advance how much data your pipeline will need to handle. By writing dataframe-agnostic code which works across eager and lazy dataframes, you can avoid dataframe lock-in and seamlessly scale as and when you need to.
And what will your teammates think when you present them with reusable functions which they can use with whichever dataframe library they prefer? They’ll probably love you for it.
This blog post was contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.