Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Writing Portable DataFrame Code for Eager and Lazy Execution

Table of Contents

Writing Portable DataFrame Code for Eager and Lazy Execution

Have you ever written a data pipeline with pandas and then got an OutOfMemoryError when you put your full dataset through it? That’s because pandas is an eager dataframe which loads all data into memory, leading to issues with large datasets. Lazy execution avoids this by deferring computation, which reduces memory usage and boosts performance.

You may have heard about Polars and DuckDB, but are hesitant to try them. Not only would you need to update your syntax, you’d also have to adjust your solution’s logic based on some differences between them:

  • Eager dataframes (like pandas) are in-memory and have a defined row order.
  • Lazy dataframes (like DuckDB or polars.LazyFrame) defer their execution and make fewer guarantees (if any!) about row order. Without specifying order, the same logic can produce inconsistent results in operations like cum_sum(), as lazy execution may shuffle rows unpredictably between runs:
import polars as pl

polars_df1 = pl.DataFrame({"store": [1, 1, 2], "date_id": [4, 5, 6]}).lazy()
polars_df2 = pl.DataFrame({"store": [1, 2], "sales": [7, 8]}).lazy()

# Run the join operation twice to see if the outputs are the same
for _ in range(2):
    print(
        polars_df1.join(polars_df2, on="store", how="left")
        # no order is specified
        .with_columns(cumulative_sales=pl.col("sales").cum_sum().over("store"))
        .collect(engine="streaming")
    )

Output:

shape: (3, 4)
┌───────┬─────────┬───────┬──────────────────┐
│ store ┆ date_id ┆ sales ┆ cumulative_sales │
│ ---   ┆ ---     ┆ ---   ┆ ---              │
│ i64   ┆ i64     ┆ i64   ┆ i64              │
╞═══════╪═════════╪═══════╪══════════════════╡
│ 1     ┆ 4       ┆ 7     ┆ 7                │
│ 2     ┆ 6       ┆ 8     ┆ 8                │
│ 1     ┆ 5       ┆ 7     ┆ 14               │
└───────┴─────────┴───────┴──────────────────┘
shape: (3, 4)
┌───────┬─────────┬───────┬──────────────────┐
│ store ┆ date_id ┆ sales ┆ cumulative_sales │
│ ---   ┆ ---     ┆ ---   ┆ ---              │
│ i64   ┆ i64     ┆ i64   ┆ i64              │
╞═══════╪═════════╪═══════╪══════════════════╡
│ 1     ┆ 5       ┆ 7     ┆ 7                │
│ 1     ┆ 4       ┆ 7     ┆ 14               │
│ 2     ┆ 6       ┆ 8     ┆ 8                │
└───────┴─────────┴───────┴──────────────────┘

In the code above, the same lazy pipeline is run twice—but the output differs between runs.

The good news is that, by learning one simple trick, you can rid your code of row-ordering assumptions and have it work seamlessly across eager and lazy dataframes.

In this article, we’ll use Narwhals to:

  • Create a fill-forward solution that works with eager pandas and Polars dataframes
  • Upgrade it to also support lazy dataframes like polars.LazyFrame, DuckDB, and PySpark

The source code of this article can be found here:

Eager-Only Solution

First, we’ll build a fill-forward solution compatible with eager pandas and Polars dataframes.

Suppose we have a dataset of store sales over time:

from datetime import datetime

data = {
    'sale_date': [
        datetime(2025, 5, 22),
        datetime(2025, 5, 23),
        datetime(2025, 5, 24),
        datetime(2025, 5, 22),
        datetime(2025, 5, 23),
        datetime(2025, 5, 24),
    ],
    'store': [
        'Thimphu',
        'Thimphu',
        'Thimphu',
        'Paro',
        'Paro',
        'Paro',
    ],
    'sales': [1100, None, 1450, 501, 500, None],
}

Note that for the ‘sales’ column, we have some missing values. Let’s write a dataframe-agnostic function to forward-fill missing prices using the previous valid value from the same store.

To avoid locking ourselves into either pandas or Polars, we’ll use Narwhals (see our previous post on Narwhals for how to get started with it):

import narwhals as nw
from narwhals.typing import IntoFrameT


def agnostic_ffill_by_store(df_native: IntoFrameT) -> IntoFrameT:
    return (
        nw.from_native(df_native)
        .with_columns(nw.col("sales").fill_null(strategy="forward").over("store"))
        .to_native()
    )

You can verify that this works for both pandas.DataFrame and polars.DataFrame:

import polars as pl
import pandas as pd
import pyarrow as pa

# pandas.DataFrame
df_pandas = pd.DataFrame(data)
agnostic_ffill_by_store(df_pandas)

Output:

   sale_date    store   sales
0 2025-05-22  Thimphu  1100.0
1 2025-05-23  Thimphu  1100.0
2 2025-05-24  Thimphu  1450.0
3 2025-05-22     Paro   501.0
4 2025-05-23     Paro   500.0
5 2025-05-24     Paro   500.0
# polars.DataFrame
df_polars = pl.DataFrame(data)
print(agnostic_ffill_by_store(df_polars))

Output:

shape: (6, 3)
┌─────────────────────┬─────────┬───────┐
│ sale_date           ┆ store   ┆ sales │
│ ---                 ┆ ---     ┆ ---   │
│ datetime[μs]        ┆ str     ┆ i64   │
╞═════════════════════╪═════════╪═══════╡
│ 2025-05-22 00:00:00 ┆ Thimphu ┆ 1100  │
│ 2025-05-23 00:00:00 ┆ Thimphu ┆ 1100  │
│ 2025-05-24 00:00:00 ┆ Thimphu ┆ 1450  │
│ 2025-05-22 00:00:00 ┆ Paro    ┆ 501   │
│ 2025-05-23 00:00:00 ┆ Paro    ┆ 500   │
│ 2025-05-24 00:00:00 ┆ Paro    ┆ 500   │
└─────────────────────┴─────────┴───────┘

Eager and Lazy Solution

But what if we also want it to support polars.LazyFrame, duckdb.DuckDBPyRelation, and pyspark.sql.DataFrame? If we try passing any of those to agnostic_ffill_by_store, then Narwhals will raise an error:

import duckdb

duckdb_rel = duckdb.table('df_polars')
agnostic_ffill_by_store(duckdb_rel)
narwhals.exceptions.OrderDependentExprError: Order-dependent expressions are not supported for use in LazyFrame.

Hint: To make the expression valid, use `.over` with `order_by` specified.

For example, if you wrote `nw.col('price').cum_sum()` and you have a column
`'date'` which orders your data, then replace:

   nw.col('price').cum_sum()

 with:

   nw.col('price').cum_sum().over(order_by='date')
                            ^^^^^^^^^^^^^^^^^^^^^^

See https://narwhals-dev.github.io/narwhals/concepts/order_dependence/.

This error happens because lazy DataFrames don’t preserve the order of rows unless explicitly told to do so. Thus, operations like fill_null(strategy='forward') may yield unpredictable results unless you specify an ordering column explicitly.

To fix this, we must explicitly define how the rows should be ordered. Since our dataset has a sale_date column that represents time, it’s a natural choice for ordering forward-fill operations.

def agnostic_ffill_by_store_improved(df_native: IntoFrameT) -> IntoFrameT:
    return (
        nw.from_native(df_native)
        .with_columns(
            nw.col("sales")
            .fill_null(strategy="forward")
            # Note the `order_by` statement
            .over("store", order_by="sale_date")
        )
        .to_native()
    )

And now – voilà – it also supports DuckDB and Polars LazyFrame!

agnostic_ffill_by_store_improved(duckdb_rel)
┌─────────────────────┬─────────┬───────┐
│      sale_date      │  store  │ sales │
│      timestamp      │ varchar │ int64 │
├─────────────────────┼─────────┼───────┤
│ 2025-05-22 00:00:00 │ Paro    │   501 │
│ 2025-05-23 00:00:00 │ Paro    │   500 │
│ 2025-05-24 00:00:00 │ Paro    │   500 │
│ 2025-05-22 00:00:00 │ Thimphu │  1100 │
│ 2025-05-23 00:00:00 │ Thimphu │  1100 │
│ 2025-05-24 00:00:00 │ Thimphu │  1450 │
└─────────────────────┴─────────┴───────┘
agnostic_ffill_by_store_improved(df_polars.lazy()).collect()
┌─────────────────────┬─────────┬───────┐
│ sale_date           ┆ store   ┆ sales │
│ ---                 ┆ ---     ┆ ---   │
│ datetime[μs]        ┆ str     ┆ i64   │
╞═════════════════════╪═════════╪═══════╡
│ 2025-05-22 00:00:00 ┆ Thimphu ┆ 1100  │
│ 2025-05-23 00:00:00 ┆ Thimphu ┆ 1100  │
│ 2025-05-24 00:00:00 ┆ Thimphu ┆ 1450  │
│ 2025-05-22 00:00:00 ┆ Paro    ┆ 501   │
│ 2025-05-23 00:00:00 ┆ Paro    ┆ 500   │
│ 2025-05-24 00:00:00 ┆ Paro    ┆ 500   │
└─────────────────────┴─────────┴───────┘

Being forced to specify an ordering column for fill_null(strategy='forward') may feel like a restriction on the Polars API. However, if we want to write portable and maintainable code, we cannot make assumptions about whether our backend preserves row order, and so this is a restriction that enables portability!

Final thoughts

Eager execution works perfectly well…until it doesn’t. It’s hard to know in advance how much data your pipeline will need to handle. By writing dataframe-agnostic code which works across eager and lazy dataframes, you can avoid dataframe lock-in and seamlessly scale as and when you need to.

And what will your teammates think when you present them with reusable functions which they can use with whichever dataframe library they prefer? They’ll probably love you for it.

This blog post was contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran