copy-on-write Archives

What’s New in pandas 3.0: Expressions, Copy-on-Write, and Faster Strings

Leave a Comment / Blog, DataFrame, Python Utilities / Khuyen Tran

Table of Contents

Introduction
Setup
Cleaner Column Operations with pd.col
Copy-on-Write Is Now the Default
A Dedicated String Dtype
Final Thoughts

Introduction
pandas 3.0 brings some of the most significant changes to the library in years. This article covers:

pd.col expressions: Cleaner column operations without lambdas
Copy-on-Write: Predictable copy behavior by default
PyArrow-backed strings: Faster operations and better type safety

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Easy-to-digest articles on Python, AI, and open-source tools. Delivered twice a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* WordPress dark-theme overrides */
.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}

.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}

.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}

.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Setup
pandas 3.0 requires Python 3.11 or higher. Install it with:
pip install –upgrade pandas

To test these features before upgrading, enable them in pandas 2.3:
pd.options.future.infer_string = True
pd.options.mode.copy_on_write = True

Cleaner Column Operations with pd.col
The Traditional Approaches
If you’ve ever had to modify an existing column or create a new one, you may be used to one of these approaches.
Square-bracket notation is the most common way to add a column. You reference the new column name and assign the result:
import pandas as pd

df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df['temp_f'] = df['temp_c'] * 9/5 + 32
df

temp_c
temp_f

0
0
32.0

1
20
68.0

2
30
86.0

3
100
212.0

This overwrites your original DataFrame, which means you can’t compare before and after without first making a copy.
df_original = pd.DataFrame({"temp_c": [0, 20, 30]})
df_original['temp_f'] = df_original['temp_c'] * 9/5 + 32
# df_original is now modified – no way to see the original state
df_original

temp_c
temp_f

0
0
32.0

1
20
68.0

2
30
86.0

It also doesn’t return anything, so you can’t chain it with other operations. Method-chaining lets you write df.assign(…).query(…).sort_values(…) in one expression instead of multiple separate statements.
df = pd.DataFrame({"temp_c": [0, 20, 30]})

# This doesn't work – square-bracket assignment returns None
# df['temp_f'] = df['temp_c'] * 9/5 + 32.query('temp_f > 50')

# You need separate statements instead
df['temp_f'] = df['temp_c'] * 9/5 + 32
df = df.query('temp_f > 50')
df

temp_c
temp_f

1
20
68.0

2
30
86.0

Using assign solves the chaining problem by returning a new DataFrame instead of modifying in-place:
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = (
df.assign(temp_f=lambda x: x['temp_c'] * 9/5 + 32)
.query('temp_f > 50')
)
df

temp_c
temp_f

1
20
68.0

2
30
86.0

3
100
212.0

This works for chaining but relies on lambda functions. Lambda functions capture variables by reference, not by value, which can cause bugs:
df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
results[f'x_times_{factor}'] = lambda df: df['x'] * factor

df = df.assign(**results)
df

x
x_times_10
x_times_20
x_times_30

0
1
30
30
30

1
2
60
60
60

2
3
90
90
90

What went wrong: We expected x_times_10 to multiply by 10, x_times_20 by 20, and x_times_30 by 30. Instead, all three columns multiply by 30.
Why: Lambdas don’t save values, they save variable names. All three lambdas point to the same variable factor. After the loop ends, factor = 30. When assign() executes the lambdas, they all read factor and get 30.
The pandas 3.0 Solution: pd.col
pandas 3.0 introduces pd.col, which lets you reference columns without lambda functions. The syntax is borrowed from PySpark and Polars.
Here’s the temp_f conversion rewritten with pd.col:
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = df.assign(temp_f=pd.col('temp_c') * 9/5 + 32)
df

temp_c
temp_f

0
0
32.0

1
20
68.0

2
30
86.0

3
100
212.0

Unlike square-bracket notation, pd.col supports method-chaining. Unlike lambdas, it doesn’t capture variables by reference, so you avoid the scoping bugs shown earlier.
Remember the lambda scoping bug? With pd.col, each multiplier is captured correctly:
df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
results[f'x_times_{factor}'] = pd.col('x') * factor

df = df.assign(**results)
df

x
x_times_10
x_times_20
x_times_30

0
1
10
20
30

1
2
20
40
60

2
3
30
60
90

Filtering with Expressions
Traditional filtering repeats df twice:
df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[df['temp_c'] >= 0] # df appears twice
df

temp_c

1
0

2
15

3
25

4
30

With pd.col, you reference the column directly:
df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[pd.col('temp_c') >= 0] # cleaner
df

temp_c

1
0

2
15

3
25

4
30

Combining Multiple Columns
With lambdas, you need to repeat lambda x: x[…] for every column:
df = pd.DataFrame({
"price": [100, 200, 150],
"quantity": [2, 3, 4]
})

df = df.assign(
total=lambda x: x["price"] * x["quantity"],
discounted=lambda x: x["price"] * x["quantity"] * 0.9
)
df

price
quantity
total
discounted

0
100
2
200
180.0

1
200
3
600
540.0

2
150
4
600
540.0

With pd.col, the same logic is more readable:
df = pd.DataFrame({
"price": [100, 200, 150],
"quantity": [2, 3, 4]
})

df = df.assign(
total=pd.col("price") * pd.col("quantity"),
discounted=pd.col("price") * pd.col("quantity") * 0.9
)
df

price
quantity
total
discounted

0
100
2
200
180.0

1
200
3
600
540.0

2
150
4
600
540.0

Note that, unlike Polars and PySpark, pd.col cannot yet be used in groupby operations:
# This works in Polars: df.group_by("category").agg(pl.col("value").mean())
# But this doesn't work in pandas 3.0:
df.groupby("category").agg(pd.col("value").mean()) # Not supported yet

This limitation may be removed in future versions.
Copy-on-Write Is Now the Default
If you’ve used pandas, you’ve probably seen the SettingWithCopyWarning at some point. It appears when pandas can’t tell if you’re modifying a view or a copy of your data:
# This pattern caused confusion in pandas < 3.0
df2 = df[df["value"] > 10]
df2["status"] = "high" # SettingWithCopyWarning!

Did this modify df or just df2? The answer depends on whether df2 is a view or a copy, and pandas can’t always predict which one it created. That’s what the warning is telling you.
pandas 3.0 makes the answer simple: filtering with df[…] always returns a copy. Modifying df2 never affects df.
This is called Copy-on-Write (CoW). If you just read df2, pandas shares memory with df. Only when you change df2 does pandas create a separate copy.
Now when you filter and modify, there’s no warning and no uncertainty:
df = pd.DataFrame({"value": [5, 15, 25], "status": ["low", "low", "low"]})

# pandas 3.0: just works, no warning
df2 = df[df["value"] > 10]
df2["status"] = "high" # Modifies df2 only, not df

df2

value
status

1
15
high

2
25
high

value
status

0
5
low

1
15
low

2
25
low

We can see that df is unchanged and no warning was raised.
Breaking Change: Chained Assignment
One pattern that breaks is chained assignment. With CoW, df["foo"] is a copy, so assigning to it only modifies the copy and doesn’t modify the original:
# This NO LONGER modifies df in pandas 3.0:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})

df["foo"][df["bar"] > 5] = 100
df

foo
bar

0
1
4

1
2
6

2
3
8

Notice foo still contains [1, 2, 3]. This is because the value 100 was assigned to a copy that was immediately discarded.
Use .loc instead to modify the original DataFrame:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})
df.loc[df["bar"] > 5, "foo"] = 100
df

foo
bar

0
1
4

1
100
6

2
100
8

A Dedicated String Dtype
pandas 2.x stores strings as object dtype, which is both slow and ambiguous. You can’t tell from the dtype alone whether a column is purely strings:
pd.options.future.infer_string = False # pandas 2.x behavior

text = pd.Series(["hello", "world"])
messy = pd.Series(["hello", 42, {"key": "value"}])

print(f"text dtype: {text.dtype}")
print(f"messy dtype: {messy.dtype}")

text dtype: object
messy dtype: object

pandas 3.0 introduces a dedicated str dtype that only holds strings, making the type immediately clear:
pd.options.future.infer_string = True # pandas 3.0 behavior

ser = pd.Series(["a", "b", "c"])
print(f"dtype: {ser.dtype}")

dtype: str

Performance Gains
The new string dtype is backed by PyArrow (if installed), which provides significant performance improvements:

String operations run 5-10x faster because PyArrow processes data in contiguous memory blocks instead of individual Python objects
Memory usage reduced by up to 50% since strings are stored in a compact binary format rather than as Python objects with overhead

Arrow Ecosystem Interoperability
DataFrames can be passed to Arrow-based tools like Polars and DuckDB without copying or converting data:
import polars as pl

pandas_df = pd.DataFrame({"name": ["alice", "bob", "charlie"]})
polars_df = pl.from_pandas(pandas_df) # Zero-copy – data already in Arrow format
polars_df

name

0
alice

1
bob

2
charlie

Final Thoughts
pandas 3.0 brings meaningful improvements to your daily workflow:

Write cleaner code with pd.col expressions instead of lambdas
Avoid SettingWithCopyWarning confusion with Copy-on-Write as the default
Get 5-10x faster string operations with the new PyArrow-backed str dtype
Pass DataFrames to Polars and DuckDB without data conversion

Related Resources
For more on DataFrame tools and performance optimization:

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames – Compare pandas with Polars for performance-critical workflows
Scaling Pandas Workflows with PySpark’s Pandas API – Use familiar pandas syntax on distributed data
pandas vs Polars vs DuckDB: A Data Scientist’s Guide – Choose the right tool for your data analysis needs

💡 The expressions section was inspired by a blog post contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.