Pandas Archives

Efficient String Data Handling in pandas 2.0 with PyArrow Arrays

Leave a Comment / Pandas, Python Tips / Khuyen Tran

As of pandas 2.0, data in pandas can be stored in PyArrow arrays in addition to NumPy arrays. PyArrow arrays provide a wide range of data types compared to NumPy.

One significant advantage of PyArrow arrays is their string datatype, which offers superior speed and memory efficiency than storing strings using object types.

Full code.
Favorite

Efficient String Data Handling in pandas 2.0 with PyArrow Arrays Read More »

Enhancing Data Handling with scikit-learn’s DataFrame Support

Leave a Comment / Feature Engineer, Pandas / Khuyen Tran

By default, scikit-learn transformers return a NumPy array. This can pose a challenge if a pandas DataFrame is required for subsequent data processing steps.

Luckily, as of scikit-learn version 1.3.2, you can use the set_output method to obtain the results as a pandas DataFrame.

This method is not limited to individual transformers but can also be applied within a scikit-learn pipeline.

Enhancing Data Handling with scikit-learn’s DataFrame Support Read More »

Read HTML Tables Using Pandas

Leave a Comment / Pandas / Khuyen Tran

If you want to quickly extract a table on a website and turn it into a pandas DataFrame, use pd.read_html. In the code above, I extracted the table from a Wikipedia page in one line of code.
Favorite

Read HTML Tables Using Pandas Read More »

Say Goodbye to Data Type Conversion in pandas 2.0

4 Comments / Pandas / Khuyen Tran

Previously in pandas, if a Series had missing values, its data type would be converted to float, resulting in a potential loss of precision for the original data.

With the integration of Apache Arrow in pandas 2.0, this issue is solved.
Favorite

Say Goodbye to Data Type Conversion in pandas 2.0 Read More »

pandarallel: A Simple Tool to Parallelize Pandas Operations

Code Optimization, Pandas / Khuyen Tran

If you want to parallelize your Pandas operations on all available CPUs by adding only one line of code, try pandarallel.

pandarallel: A Simple Tool to Parallelize Pandas Operations Read More »

Streamlining Data Transformations with Pandas’ pipe and assign Methods

Leave a Comment / Pandas / Khuyen Tran

To streamline complex data transformations and create new columns in a chainable manner, use pandas’ pipe and assign methods.

df.pipe is more generic and can handle a broader range of operations, while df.assign is tailored for creating or modifying columns.
Favorite

Streamlining Data Transformations with Pandas’ pipe and assign Methods Read More »

Specify Datetime Columns with parse_dates

Leave a Comment / Pandas / Khuyen Tran

Use the parse_dates parameter to specify datetime columns when creating a pandas DataFrame from a CSV, rather than converting columns to datetime post-creation. This keeps the code concise and easier to read.
Favorite

Specify Datetime Columns with parse_dates Read More »

Leverage PyArrow for Efficient Parquet Data Filtering

Leave a Comment / Pandas / Khuyen Tran

When dealing with Parquet files in pandas, it is common to first load the data into a pandas DataFrame and then apply filters.

To improve query execution speed, push down the filers to the PyArrow engine to leverage PyArrow’s processing optimizations.

In the code above, filtering a dataset of 100 million rows using PyArrow is approximately 113 times faster than filtering using pandas.

Leverage PyArrow for Efficient Parquet Data Filtering Read More »

Enhance Readability in DataFrame Merging with Custom Suffixes

Leave a Comment / Pandas / Khuyen Tran

When merging two DataFrames with overlapping columns, the default behavior is to add suffixes “_x” and “_y” to the column names. To improve readability, you can specify custom suffixes.

Enhance Readability in DataFrame Merging with Custom Suffixes Read More »