Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

What’s New in pandas 3.0: Expressions, Copy-on-Write, and Faster Strings

What’s New in pandas 3.0: Expressions, Copy-on-Write, and Faster Strings

Table of Contents

Introduction

pandas 3.0 brings some of the most significant changes to the library in years. This article covers:

  • pd.col expressions: Cleaner column operations without lambdas
  • Copy-on-Write: Predictable copy behavior by default
  • PyArrow-backed strings: Faster operations and better type safety

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Setup

pandas 3.0 is currently in pre-release. To follow along with the examples, install the release candidate:

pip install --upgrade --pre pandas==3.0.0rc1

Cleaner Column Operations with pd.col

The Traditional Approaches

If you’ve ever had to modify an existing column or create a new one, you may be used to one of these approaches.

Square-bracket notation is the most common way to add a column. You reference the new column name and assign the result:

import pandas as pd

df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df['temp_f'] = df['temp_c'] * 9/5 + 32
df
temp_c temp_f
0 0 32.0
1 20 68.0
2 30 86.0
3 100 212.0

This overwrites your original DataFrame, which means you can’t compare before and after without first making a copy.

df_original = pd.DataFrame({"temp_c": [0, 20, 30]})
df_original['temp_f'] = df_original['temp_c'] * 9/5 + 32
# df_original is now modified - no way to see the original state
df_original
temp_c temp_f
0 0 32.0
1 20 68.0
2 30 86.0

It also doesn’t return anything, so you can’t chain it with other operations. Method-chaining lets you write df.assign(...).query(...).sort_values(...) in one expression instead of multiple separate statements.

df = pd.DataFrame({"temp_c": [0, 20, 30]})

# This doesn't work - square-bracket assignment returns None
# df['temp_f'] = df['temp_c'] * 9/5 + 32.query('temp_f > 50')

# You need separate statements instead
df['temp_f'] = df['temp_c'] * 9/5 + 32
df = df.query('temp_f > 50')
df
temp_c temp_f
1 20 68.0
2 30 86.0

Using assign solves the chaining problem by returning a new DataFrame instead of modifying in-place:

df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = (
    df.assign(temp_f=lambda x: x['temp_c'] * 9/5 + 32)
    .query('temp_f > 50')
)
df
temp_c temp_f
1 20 68.0
2 30 86.0
3 100 212.0

This works for chaining but relies on lambda functions. Lambda functions capture variables by reference, not by value, which can cause bugs:

df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
    results[f'x_times_{factor}'] = lambda df: df['x'] * factor

df = df.assign(**results)
df
x x_times_10 x_times_20 x_times_30
0 1 30 30 30
1 2 60 60 60
2 3 90 90 90

What went wrong: We expected x_times_10 to multiply by 10, x_times_20 by 20, and x_times_30 by 30. Instead, all three columns multiply by 30.

Why: Lambdas don’t save values, they save variable names. All three lambdas point to the same variable factor. After the loop ends, factor = 30. When assign() executes the lambdas, they all read factor and get 30.

The pandas 3.0 Solution: pd.col

pandas 3.0 introduces pd.col, which lets you reference columns without lambda functions. The syntax is borrowed from PySpark and Polars.

Here’s the temp_f conversion rewritten with pd.col:

df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = df.assign(temp_f=pd.col('temp_c') * 9/5 + 32)
df
temp_c temp_f
0 0 32.0
1 20 68.0
2 30 86.0
3 100 212.0

Unlike square-bracket notation, pd.col supports method-chaining. Unlike lambdas, it doesn’t capture variables by reference, so you avoid the scoping bugs shown earlier.

Remember the lambda scoping bug? With pd.col, each multiplier is captured correctly:

df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
    results[f'x_times_{factor}'] = pd.col('x') * factor

df = df.assign(**results)
df
x x_times_10 x_times_20 x_times_30
0 1 10 20 30
1 2 20 40 60
2 3 30 60 90

Filtering with Expressions

Traditional filtering repeats df twice:

df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[df['temp_c'] >= 0]  # df appears twice
df
temp_c
1 0
2 15
3 25
4 30

With pd.col, you reference the column directly:

df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[pd.col('temp_c') >= 0]  # cleaner
df
temp_c
1 0
2 15
3 25
4 30

Combining Multiple Columns

With lambdas, you need to repeat lambda x: x[...] for every column:

df = pd.DataFrame({
    "price": [100, 200, 150],
    "quantity": [2, 3, 4]
})

df = df.assign(
    total=lambda x: x["price"] * x["quantity"],
    discounted=lambda x: x["price"] * x["quantity"] * 0.9
)
df
price quantity total discounted
0 100 2 200 180.0
1 200 3 600 540.0
2 150 4 600 540.0

With pd.col, the same logic is more readable:

df = pd.DataFrame({
    "price": [100, 200, 150],
    "quantity": [2, 3, 4]
})

df = df.assign(
    total=pd.col("price") * pd.col("quantity"),
    discounted=pd.col("price") * pd.col("quantity") * 0.9
)
df
price quantity total discounted
0 100 2 200 180.0
1 200 3 600 540.0
2 150 4 600 540.0

Note that, unlike Polars and PySpark, pd.col cannot yet be used in groupby operations:

# This works in Polars: df.group_by("category").agg(pl.col("value").mean())
# But this doesn't work in pandas 3.0:
df.groupby("category").agg(pd.col("value").mean())  # Not supported yet

This limitation may be removed in future versions.

Copy-on-Write Is Now the Default

If you’ve used pandas, you’ve probably seen the SettingWithCopyWarning at some point. It appears when pandas can’t tell if you’re modifying a view or a copy of your data:

# This pattern caused confusion in pandas < 3.0
df2 = df[df["value"] > 10]
df2["status"] = "high"  # SettingWithCopyWarning!

Did this modify df or just df2? The answer depends on whether df2 is a view or a copy, and pandas can’t always predict which one it created. That’s what the warning is telling you.

pandas 3.0 makes the answer simple: filtering with df[...] always returns a copy. Modifying df2 never affects df.

This is called Copy-on-Write (CoW). If you just read df2, pandas shares memory with df. Only when you change df2 does pandas create a separate copy.

Now when you filter and modify, there’s no warning and no uncertainty:

df = pd.DataFrame({"value": [5, 15, 25], "status": ["low", "low", "low"]})

# pandas 3.0: just works, no warning
df2 = df[df["value"] > 10]
df2["status"] = "high"  # Modifies df2 only, not df

df2
value status
1 15 high
2 25 high
df
value status
0 5 low
1 15 low
2 25 low

We can see that df is unchanged and no warning was raised.

Breaking Change: Chained Assignment

One pattern that breaks is chained assignment. With CoW, df["foo"] is a copy, so assigning to it only modifies the copy and doesn’t modify the original:

# This NO LONGER modifies df in pandas 3.0:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})

df["foo"][df["bar"] > 5] = 100
df
foo bar
0 1 4
1 2 6
2 3 8

Notice foo still contains [1, 2, 3]. This is because the value 100 was assigned to a copy that was immediately discarded.

Use .loc instead to modify the original DataFrame:

df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})
df.loc[df["bar"] > 5, "foo"] = 100
df
foo bar
0 1 4
1 100 6
2 100 8

A Dedicated String Dtype

pandas 2.x stores strings as object dtype, which is both slow and ambiguous. You can’t tell from the dtype alone whether a column is purely strings:

pd.options.future.infer_string = False  # pandas 2.x behavior

text = pd.Series(["hello", "world"])
messy = pd.Series(["hello", 42, {"key": "value"}])

print(f"text dtype: {text.dtype}")
print(f"messy dtype: {messy.dtype}")
text dtype: object
messy dtype: object

pandas 3.0 introduces a dedicated str dtype that only holds strings, making the type immediately clear:

pd.options.future.infer_string = True  # pandas 3.0 behavior

ser = pd.Series(["a", "b", "c"])
print(f"dtype: {ser.dtype}")
dtype: str

Performance Gains

The new string dtype is backed by PyArrow (if installed), which provides significant performance improvements:

  • String operations run 5-10x faster because PyArrow processes data in contiguous memory blocks instead of individual Python objects
  • Memory usage reduced by up to 50% since strings are stored in a compact binary format rather than as Python objects with overhead

Arrow Ecosystem Interoperability

DataFrames can be passed to Arrow-based tools like Polars and DuckDB without copying or converting data:

import polars as pl

pandas_df = pd.DataFrame({"name": ["alice", "bob", "charlie"]})
polars_df = pl.from_pandas(pandas_df)  # Zero-copy - data already in Arrow format
polars_df
name
0 alice
1 bob
2 charlie

Final Thoughts

pandas 3.0 brings meaningful improvements to your daily workflow:

  • Write cleaner code with pd.col expressions instead of lambdas
  • Avoid SettingWithCopyWarning confusion with Copy-on-Write as the default
  • Get 5-10x faster string operations with the new PyArrow-backed str dtype
  • Pass DataFrames to Polars and DuckDB without data conversion

You can test these features in pandas 2.3 before upgrading by enabling the future flags:

import pandas as pd

# Enable PyArrow-backed strings
pd.options.future.infer_string = True

# Enable Copy-on-Write behavior
pd.options.mode.copy_on_write = True

Fix any deprecation warnings that appear, and you’ll be ready for 3.0.

The expressions section was inspired by a blog post contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.

Related Resources

For more on DataFrame tools and performance optimization:


📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran