Table of Contents
- Introduction
- Setup
- Cleaner Column Operations with pd.col
- Copy-on-Write Is Now the Default
- A Dedicated String Dtype
- Final Thoughts
Introduction
pandas 3.0 brings some of the most significant changes to the library in years. This article covers:
pd.colexpressions: Cleaner column operations without lambdas- Copy-on-Write: Predictable copy behavior by default
- PyArrow-backed strings: Faster operations and better type safety
💻 Get the Code: Open the notebook in Google Colab to run it in your browser, or grab the source from GitHub.
Stay Current with CodeCut
Easy-to-digest articles on Python, AI, and open-source tools. Delivered twice a week.
Setup
pandas 3.0 requires Python 3.11 or higher. Install it with:
pip install --upgrade pandas
To test these features before upgrading, enable them in pandas 2.3:
pd.options.future.infer_string = True
pd.options.mode.copy_on_write = True
Cleaner Column Operations with pd.col
The Traditional Approaches
If you’ve ever had to modify an existing column or create a new one, you may be used to one of these approaches.
Square-bracket notation is the most common way to add a column. You reference the new column name and assign the result:
import pandas as pd
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df['temp_f'] = df['temp_c'] * 9/5 + 32
df
| temp_c | temp_f | |
|---|---|---|
| 0 | 0 | 32.0 |
| 1 | 20 | 68.0 |
| 2 | 30 | 86.0 |
| 3 | 100 | 212.0 |
This overwrites your original DataFrame, which means you can’t compare before and after without first making a copy.
df_original = pd.DataFrame({"temp_c": [0, 20, 30]})
df_original['temp_f'] = df_original['temp_c'] * 9/5 + 32
# df_original is now modified - no way to see the original state
df_original
| temp_c | temp_f | |
|---|---|---|
| 0 | 0 | 32.0 |
| 1 | 20 | 68.0 |
| 2 | 30 | 86.0 |
It also doesn’t return anything, so you can’t chain it with other operations. Method-chaining lets you write df.assign(...).query(...).sort_values(...) in one expression instead of multiple separate statements.
df = pd.DataFrame({"temp_c": [0, 20, 30]})
# This doesn't work - square-bracket assignment returns None
# df['temp_f'] = df['temp_c'] * 9/5 + 32.query('temp_f > 50')
# You need separate statements instead
df['temp_f'] = df['temp_c'] * 9/5 + 32
df = df.query('temp_f > 50')
df
| temp_c | temp_f | |
|---|---|---|
| 1 | 20 | 68.0 |
| 2 | 30 | 86.0 |
Using assign solves the chaining problem by returning a new DataFrame instead of modifying in-place:
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = (
df.assign(temp_f=lambda x: x['temp_c'] * 9/5 + 32)
.query('temp_f > 50')
)
df
| temp_c | temp_f | |
|---|---|---|
| 1 | 20 | 68.0 |
| 2 | 30 | 86.0 |
| 3 | 100 | 212.0 |
This works for chaining but relies on lambda functions. Lambda functions capture variables by reference, not by value, which can cause bugs:
df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
results[f'x_times_{factor}'] = lambda df: df['x'] * factor
df = df.assign(**results)
df
| x | x_times_10 | x_times_20 | x_times_30 | |
|---|---|---|---|---|
| 0 | 1 | 30 | 30 | 30 |
| 1 | 2 | 60 | 60 | 60 |
| 2 | 3 | 90 | 90 | 90 |
What went wrong: We expected x_times_10 to multiply by 10, x_times_20 by 20, and x_times_30 by 30. Instead, all three columns multiply by 30.
Why: Lambdas don’t save values, they save variable names. All three lambdas point to the same variable factor. After the loop ends, factor = 30. When assign() executes the lambdas, they all read factor and get 30.
The pandas 3.0 Solution: pd.col
pandas 3.0 introduces pd.col, which lets you reference columns without lambda functions. The syntax is borrowed from PySpark and Polars.
Here’s the temp_f conversion rewritten with pd.col:
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = df.assign(temp_f=pd.col('temp_c') * 9/5 + 32)
df
| temp_c | temp_f | |
|---|---|---|
| 0 | 0 | 32.0 |
| 1 | 20 | 68.0 |
| 2 | 30 | 86.0 |
| 3 | 100 | 212.0 |
Unlike square-bracket notation, pd.col supports method-chaining. Unlike lambdas, it doesn’t capture variables by reference, so you avoid the scoping bugs shown earlier.
Remember the lambda scoping bug? With pd.col, each multiplier is captured correctly:
df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
results[f'x_times_{factor}'] = pd.col('x') * factor
df = df.assign(**results)
df
| x | x_times_10 | x_times_20 | x_times_30 | |
|---|---|---|---|---|
| 0 | 1 | 10 | 20 | 30 |
| 1 | 2 | 20 | 40 | 60 |
| 2 | 3 | 30 | 60 | 90 |
Filtering with Expressions
Traditional filtering repeats df twice:
df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[df['temp_c'] >= 0] # df appears twice
df
| temp_c | |
|---|---|
| 1 | 0 |
| 2 | 15 |
| 3 | 25 |
| 4 | 30 |
With pd.col, you reference the column directly:
df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[pd.col('temp_c') >= 0] # cleaner
df
| temp_c | |
|---|---|
| 1 | 0 |
| 2 | 15 |
| 3 | 25 |
| 4 | 30 |
Combining Multiple Columns
With lambdas, you need to repeat lambda x: x[...] for every column:
df = pd.DataFrame({
"price": [100, 200, 150],
"quantity": [2, 3, 4]
})
df = df.assign(
total=lambda x: x["price"] * x["quantity"],
discounted=lambda x: x["price"] * x["quantity"] * 0.9
)
df
| price | quantity | total | discounted | |
|---|---|---|---|---|
| 0 | 100 | 2 | 200 | 180.0 |
| 1 | 200 | 3 | 600 | 540.0 |
| 2 | 150 | 4 | 600 | 540.0 |
With pd.col, the same logic is more readable:
df = pd.DataFrame({
"price": [100, 200, 150],
"quantity": [2, 3, 4]
})
df = df.assign(
total=pd.col("price") * pd.col("quantity"),
discounted=pd.col("price") * pd.col("quantity") * 0.9
)
df
| price | quantity | total | discounted | |
|---|---|---|---|---|
| 0 | 100 | 2 | 200 | 180.0 |
| 1 | 200 | 3 | 600 | 540.0 |
| 2 | 150 | 4 | 600 | 540.0 |
Note that, unlike Polars and PySpark, pd.col cannot yet be used in groupby operations:
# This works in Polars: df.group_by("category").agg(pl.col("value").mean())
# But this doesn't work in pandas 3.0:
df.groupby("category").agg(pd.col("value").mean()) # Not supported yet
This limitation may be removed in future versions.
Copy-on-Write Is Now the Default
If you’ve used pandas, you’ve probably seen the SettingWithCopyWarning at some point. It appears when pandas can’t tell if you’re modifying a view or a copy of your data:
# This pattern caused confusion in pandas < 3.0
df2 = df[df["value"] > 10]
df2["status"] = "high" # SettingWithCopyWarning!
Did this modify df or just df2? The answer depends on whether df2 is a view or a copy, and pandas can’t always predict which one it created. That’s what the warning is telling you.
pandas 3.0 makes the answer simple: filtering with df[...] always returns a copy. Modifying df2 never affects df.
This is called Copy-on-Write (CoW). If you just read df2, pandas shares memory with df. Only when you change df2 does pandas create a separate copy.
Now when you filter and modify, there’s no warning and no uncertainty:
df = pd.DataFrame({"value": [5, 15, 25], "status": ["low", "low", "low"]})
# pandas 3.0: just works, no warning
df2 = df[df["value"] > 10]
df2["status"] = "high" # Modifies df2 only, not df
df2
| value | status | |
|---|---|---|
| 1 | 15 | high |
| 2 | 25 | high |
df
| value | status | |
|---|---|---|
| 0 | 5 | low |
| 1 | 15 | low |
| 2 | 25 | low |
We can see that df is unchanged and no warning was raised.
Breaking Change: Chained Assignment
One pattern that breaks is chained assignment. With CoW, df["foo"] is a copy, so assigning to it only modifies the copy and doesn’t modify the original:
# This NO LONGER modifies df in pandas 3.0:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})
df["foo"][df["bar"] > 5] = 100
df
| foo | bar | |
|---|---|---|
| 0 | 1 | 4 |
| 1 | 2 | 6 |
| 2 | 3 | 8 |
Notice foo still contains [1, 2, 3]. This is because the value 100 was assigned to a copy that was immediately discarded.
Use .loc instead to modify the original DataFrame:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})
df.loc[df["bar"] > 5, "foo"] = 100
df
| foo | bar | |
|---|---|---|
| 0 | 1 | 4 |
| 1 | 100 | 6 |
| 2 | 100 | 8 |
A Dedicated String Dtype
pandas 2.x stores strings as object dtype, which is both slow and ambiguous. You can’t tell from the dtype alone whether a column is purely strings:
pd.options.future.infer_string = False # pandas 2.x behavior
text = pd.Series(["hello", "world"])
messy = pd.Series(["hello", 42, {"key": "value"}])
print(f"text dtype: {text.dtype}")
print(f"messy dtype: {messy.dtype}")
text dtype: object
messy dtype: object
pandas 3.0 introduces a dedicated str dtype that only holds strings, making the type immediately clear:
pd.options.future.infer_string = True # pandas 3.0 behavior
ser = pd.Series(["a", "b", "c"])
print(f"dtype: {ser.dtype}")
dtype: str
Performance Gains
The new string dtype is backed by PyArrow (if installed), which provides significant performance improvements:
- String operations run 5-10x faster because PyArrow processes data in contiguous memory blocks instead of individual Python objects
- Memory usage reduced by up to 50% since strings are stored in a compact binary format rather than as Python objects with overhead
Arrow Ecosystem Interoperability
DataFrames can be passed to Arrow-based tools like Polars and DuckDB without copying or converting data:
import polars as pl
pandas_df = pd.DataFrame({"name": ["alice", "bob", "charlie"]})
polars_df = pl.from_pandas(pandas_df) # Zero-copy - data already in Arrow format
polars_df
| name | |
|---|---|
| 0 | alice |
| 1 | bob |
| 2 | charlie |
Final Thoughts
pandas 3.0 brings meaningful improvements to your daily workflow:
- Write cleaner code with
pd.colexpressions instead of lambdas - Avoid
SettingWithCopyWarningconfusion with Copy-on-Write as the default - Get 5-10x faster string operations with the new PyArrow-backed
strdtype - Pass DataFrames to Polars and DuckDB without data conversion
Related Resources
For more on DataFrame tools and performance optimization:
- Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames – Compare pandas with Polars for performance-critical workflows
- Scaling Pandas Workflows with PySpark’s Pandas API – Use familiar pandas syntax on distributed data
- pandas vs Polars vs DuckDB: A Data Scientist’s Guide – Choose the right tool for your data analysis needs
💡 The expressions section was inspired by a blog post contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.
Stay Current with CodeCut
Easy-to-digest articles on Python, AI, and open-source tools. Delivered twice a week.





