Table of Contents
- Introduction
- Setup
- Cleaner Column Operations with pd.col
- Copy-on-Write Is Now the Default
- A Dedicated String Dtype
- Final Thoughts
Introduction
pandas 3.0 brings some of the most significant changes to the library in years. This article covers:
pd.colexpressions: Cleaner column operations without lambdas- Copy-on-Write: Predictable copy behavior by default
- PyArrow-backed strings: Faster operations and better type safety
💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!
Setup
pandas 3.0 is currently in pre-release. To follow along with the examples, install the release candidate:
pip install --upgrade --pre pandas==3.0.0rc1
Cleaner Column Operations with pd.col
The Traditional Approaches
If you’ve ever had to modify an existing column or create a new one, you may be used to one of these approaches.
Square-bracket notation is the most common way to add a column. You reference the new column name and assign the result:
import pandas as pd
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df['temp_f'] = df['temp_c'] * 9/5 + 32
df
| temp_c | temp_f | |
|---|---|---|
| 0 | 0 | 32.0 |
| 1 | 20 | 68.0 |
| 2 | 30 | 86.0 |
| 3 | 100 | 212.0 |
This overwrites your original DataFrame, which means you can’t compare before and after without first making a copy.
df_original = pd.DataFrame({"temp_c": [0, 20, 30]})
df_original['temp_f'] = df_original['temp_c'] * 9/5 + 32
# df_original is now modified - no way to see the original state
df_original
| temp_c | temp_f | |
|---|---|---|
| 0 | 0 | 32.0 |
| 1 | 20 | 68.0 |
| 2 | 30 | 86.0 |
It also doesn’t return anything, so you can’t chain it with other operations. Method-chaining lets you write df.assign(...).query(...).sort_values(...) in one expression instead of multiple separate statements.
df = pd.DataFrame({"temp_c": [0, 20, 30]})
# This doesn't work - square-bracket assignment returns None
# df['temp_f'] = df['temp_c'] * 9/5 + 32.query('temp_f > 50')
# You need separate statements instead
df['temp_f'] = df['temp_c'] * 9/5 + 32
df = df.query('temp_f > 50')
df
| temp_c | temp_f | |
|---|---|---|
| 1 | 20 | 68.0 |
| 2 | 30 | 86.0 |
Using assign solves the chaining problem by returning a new DataFrame instead of modifying in-place:
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = (
df.assign(temp_f=lambda x: x['temp_c'] * 9/5 + 32)
.query('temp_f > 50')
)
df
| temp_c | temp_f | |
|---|---|---|
| 1 | 20 | 68.0 |
| 2 | 30 | 86.0 |
| 3 | 100 | 212.0 |
This works for chaining but relies on lambda functions. Lambda functions capture variables by reference, not by value, which can cause bugs:
df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
results[f'x_times_{factor}'] = lambda df: df['x'] * factor
df = df.assign(**results)
df
| x | x_times_10 | x_times_20 | x_times_30 | |
|---|---|---|---|---|
| 0 | 1 | 30 | 30 | 30 |
| 1 | 2 | 60 | 60 | 60 |
| 2 | 3 | 90 | 90 | 90 |
What went wrong: We expected x_times_10 to multiply by 10, x_times_20 by 20, and x_times_30 by 30. Instead, all three columns multiply by 30.
Why: Lambdas don’t save values, they save variable names. All three lambdas point to the same variable factor. After the loop ends, factor = 30. When assign() executes the lambdas, they all read factor and get 30.
The pandas 3.0 Solution: pd.col
pandas 3.0 introduces pd.col, which lets you reference columns without lambda functions. The syntax is borrowed from PySpark and Polars.
Here’s the temp_f conversion rewritten with pd.col:
df = pd.DataFrame({"temp_c": [0, 20, 30, 100]})
df = df.assign(temp_f=pd.col('temp_c') * 9/5 + 32)
df
| temp_c | temp_f | |
|---|---|---|
| 0 | 0 | 32.0 |
| 1 | 20 | 68.0 |
| 2 | 30 | 86.0 |
| 3 | 100 | 212.0 |
Unlike square-bracket notation, pd.col supports method-chaining. Unlike lambdas, it doesn’t capture variables by reference, so you avoid the scoping bugs shown earlier.
Remember the lambda scoping bug? With pd.col, each multiplier is captured correctly:
df = pd.DataFrame({"x": [1, 2, 3]})
results = {}
for factor in [10, 20, 30]:
results[f'x_times_{factor}'] = pd.col('x') * factor
df = df.assign(**results)
df
| x | x_times_10 | x_times_20 | x_times_30 | |
|---|---|---|---|---|
| 0 | 1 | 10 | 20 | 30 |
| 1 | 2 | 20 | 40 | 60 |
| 2 | 3 | 30 | 60 | 90 |
Filtering with Expressions
Traditional filtering repeats df twice:
df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[df['temp_c'] >= 0] # df appears twice
df
| temp_c | |
|---|---|
| 1 | 0 |
| 2 | 15 |
| 3 | 25 |
| 4 | 30 |
With pd.col, you reference the column directly:
df = pd.DataFrame({"temp_c": [-10, 0, 15, 25, 30]})
df = df.loc[pd.col('temp_c') >= 0] # cleaner
df
| temp_c | |
|---|---|
| 1 | 0 |
| 2 | 15 |
| 3 | 25 |
| 4 | 30 |
Combining Multiple Columns
With lambdas, you need to repeat lambda x: x[...] for every column:
df = pd.DataFrame({
"price": [100, 200, 150],
"quantity": [2, 3, 4]
})
df = df.assign(
total=lambda x: x["price"] * x["quantity"],
discounted=lambda x: x["price"] * x["quantity"] * 0.9
)
df
| price | quantity | total | discounted | |
|---|---|---|---|---|
| 0 | 100 | 2 | 200 | 180.0 |
| 1 | 200 | 3 | 600 | 540.0 |
| 2 | 150 | 4 | 600 | 540.0 |
With pd.col, the same logic is more readable:
df = pd.DataFrame({
"price": [100, 200, 150],
"quantity": [2, 3, 4]
})
df = df.assign(
total=pd.col("price") * pd.col("quantity"),
discounted=pd.col("price") * pd.col("quantity") * 0.9
)
df
| price | quantity | total | discounted | |
|---|---|---|---|---|
| 0 | 100 | 2 | 200 | 180.0 |
| 1 | 200 | 3 | 600 | 540.0 |
| 2 | 150 | 4 | 600 | 540.0 |
Note that, unlike Polars and PySpark, pd.col cannot yet be used in groupby operations:
# This works in Polars: df.group_by("category").agg(pl.col("value").mean())
# But this doesn't work in pandas 3.0:
df.groupby("category").agg(pd.col("value").mean()) # Not supported yet
This limitation may be removed in future versions.
Copy-on-Write Is Now the Default
If you’ve used pandas, you’ve probably seen the SettingWithCopyWarning at some point. It appears when pandas can’t tell if you’re modifying a view or a copy of your data:
# This pattern caused confusion in pandas < 3.0
df2 = df[df["value"] > 10]
df2["status"] = "high" # SettingWithCopyWarning!
Did this modify df or just df2? The answer depends on whether df2 is a view or a copy, and pandas can’t always predict which one it created. That’s what the warning is telling you.
pandas 3.0 makes the answer simple: filtering with df[...] always returns a copy. Modifying df2 never affects df.
This is called Copy-on-Write (CoW). If you just read df2, pandas shares memory with df. Only when you change df2 does pandas create a separate copy.
Now when you filter and modify, there’s no warning and no uncertainty:
df = pd.DataFrame({"value": [5, 15, 25], "status": ["low", "low", "low"]})
# pandas 3.0: just works, no warning
df2 = df[df["value"] > 10]
df2["status"] = "high" # Modifies df2 only, not df
df2
| value | status | |
|---|---|---|
| 1 | 15 | high |
| 2 | 25 | high |
df
| value | status | |
|---|---|---|
| 0 | 5 | low |
| 1 | 15 | low |
| 2 | 25 | low |
We can see that df is unchanged and no warning was raised.
Breaking Change: Chained Assignment
One pattern that breaks is chained assignment. With CoW, df["foo"] is a copy, so assigning to it only modifies the copy and doesn’t modify the original:
# This NO LONGER modifies df in pandas 3.0:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})
df["foo"][df["bar"] > 5] = 100
df
| foo | bar | |
|---|---|---|
| 0 | 1 | 4 |
| 1 | 2 | 6 |
| 2 | 3 | 8 |
Notice foo still contains [1, 2, 3]. This is because the value 100 was assigned to a copy that was immediately discarded.
Use .loc instead to modify the original DataFrame:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 6, 8]})
df.loc[df["bar"] > 5, "foo"] = 100
df
| foo | bar | |
|---|---|---|
| 0 | 1 | 4 |
| 1 | 100 | 6 |
| 2 | 100 | 8 |
A Dedicated String Dtype
pandas 2.x stores strings as object dtype, which is both slow and ambiguous. You can’t tell from the dtype alone whether a column is purely strings:
pd.options.future.infer_string = False # pandas 2.x behavior
text = pd.Series(["hello", "world"])
messy = pd.Series(["hello", 42, {"key": "value"}])
print(f"text dtype: {text.dtype}")
print(f"messy dtype: {messy.dtype}")
text dtype: object
messy dtype: object
pandas 3.0 introduces a dedicated str dtype that only holds strings, making the type immediately clear:
pd.options.future.infer_string = True # pandas 3.0 behavior
ser = pd.Series(["a", "b", "c"])
print(f"dtype: {ser.dtype}")
dtype: str
Performance Gains
The new string dtype is backed by PyArrow (if installed), which provides significant performance improvements:
- String operations run 5-10x faster because PyArrow processes data in contiguous memory blocks instead of individual Python objects
- Memory usage reduced by up to 50% since strings are stored in a compact binary format rather than as Python objects with overhead
Arrow Ecosystem Interoperability
DataFrames can be passed to Arrow-based tools like Polars and DuckDB without copying or converting data:
import polars as pl
pandas_df = pd.DataFrame({"name": ["alice", "bob", "charlie"]})
polars_df = pl.from_pandas(pandas_df) # Zero-copy - data already in Arrow format
polars_df
| name | |
|---|---|
| 0 | alice |
| 1 | bob |
| 2 | charlie |
Final Thoughts
pandas 3.0 brings meaningful improvements to your daily workflow:
- Write cleaner code with
pd.colexpressions instead of lambdas - Avoid
SettingWithCopyWarningconfusion with Copy-on-Write as the default - Get 5-10x faster string operations with the new PyArrow-backed
strdtype - Pass DataFrames to Polars and DuckDB without data conversion
You can test these features in pandas 2.3 before upgrading by enabling the future flags:
import pandas as pd
# Enable PyArrow-backed strings
pd.options.future.infer_string = True
# Enable Copy-on-Write behavior
pd.options.mode.copy_on_write = True
Fix any deprecation warnings that appear, and you’ll be ready for 3.0.
The expressions section was inspired by a blog post contributed by Marco Gorelli, Senior Software Engineer at Quansight Labs.
Related Resources
For more on DataFrame tools and performance optimization:
- Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames – Compare pandas with Polars for performance-critical workflows
- Scaling Pandas Workflows with PySpark’s Pandas API – Use familiar pandas syntax on distributed data
- pandas vs Polars vs DuckDB: A Data Scientist’s Guide – Choose the right tool for your data analysis needs
📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →


