Pandas Archives

Streamlining Data Transformations with Pandas’ pipe and assign Methods

To streamline complex data transformations and create new columns in a chainable manner, use pandas’ pipe and assign methods.

df.pipe is more generic and can handle a broader range of operations, while df.assign is tailored for creating or modifying columns.

Streamlining Data Transformations with Pandas’ pipe and assign Methods Read More »

Specify Datetime Columns with parse_dates

Leave a Comment / Pandas / Khuyen Tran

Use the parse_dates parameter to specify datetime columns when creating a pandas DataFrame from a CSV, rather than converting columns to datetime post-creation. This keeps the code concise and easier to read.

Specify Datetime Columns with parse_dates Read More »

Leverage PyArrow for Efficient Parquet Data Filtering

Leave a Comment / Pandas / Khuyen Tran

When dealing with Parquet files in pandas, it is common to first load the data into a pandas DataFrame and then apply filters.

To improve query execution speed, push down the filers to the PyArrow engine to leverage PyArrow’s processing optimizations.

In the code above, filtering a dataset of 100 million rows using PyArrow is approximately 113 times faster than filtering using pandas.

Leverage PyArrow for Efficient Parquet Data Filtering Read More »

Enhance Readability in DataFrame Merging with Custom Suffixes

Leave a Comment / Pandas / Khuyen Tran

When merging two DataFrames with overlapping columns, the default behavior is to add suffixes “_x” and “_y” to the column names. To improve readability, you can specify custom suffixes.

Enhance Readability in DataFrame Merging with Custom Suffixes Read More »

Optimize Your Pandas Code with Vectorized Operations

Leave a Comment / Pandas / Khuyen Tran

Use pandas’ vectorized operations instead of performing operations on each column individually.

This leverages pandas’ optimized C implementation for better performance, especially with large datasets.

Optimize Your Pandas Code with Vectorized Operations Read More »

Efficiently Generate Falsified Examples for Unit Tests with Pandera and Hypothesis

Leave a Comment / Pandas, Testing / Khuyen Tran

Generating readable edge cases for unit tests can often be a challenging task. However, with the combined power of Pandera and Hypothesis, you can efficiently detect falsified examples and write cleaner tests.

Pandera allows you to define constraints for inputs and outputs, while Hypothesis automatically identifies edge cases that match the specified schema.

Hypothesis further simplifies complex examples until it finds a smaller example that still reproduces the issue.

Efficiently Generate Falsified Examples for Unit Tests with Pandera and Hypothesis Read More »

Divide a Large pandas DataFrame into Chunks

Leave a Comment / Code Optimization, Pandas / Khuyen Tran

Large dataframes can consume a significant amount of memory. By processing data in smaller chunks, you can avoid running out of memory and access data faster.

In the code above, using chunksize=100000 is approximately 5495 times faster than not using chunksize.

Divide a Large pandas DataFrame into Chunks Read More »

Align Pandas Objects for Effective Data Manipulation

Leave a Comment / Pandas / Khuyen Tran

To perform operations between two pandas objects, it’s often necessary to ensure that two pandas objects have the same row or column labels.

The df.align method allows you to align two pandas objects along specified axes.

Align Pandas Objects for Effective Data Manipulation Read More »

tqdm: Add Progress Bar to Your Pandas Apply

Leave a Comment / Pandas / Khuyen Tran

If you want to keep informed about the progress of a pandas apply operation, use tqdm.

tqdm: Add Progress Bar to Your Pandas Apply Read More »

Highlight Your pandas DataFrame for Easier Analysis

Leave a Comment / Pandas / Khuyen Tran

Have you ever wanted to highlight your pandas DataFrame for easier analysis? For example, you might want positive values in green and negative ones in red.

That could be done with df.style.apply.

Highlight Your pandas DataFrame for Easier Analysis Read More »

Pandas

Streamlining Data Transformations with Pandas’ pipe and assign Methods

Specify Datetime Columns with parse_dates

Leverage PyArrow for Efficient Parquet Data Filtering

Enhance Readability in DataFrame Merging with Custom Suffixes

Optimize Your Pandas Code with Vectorized Operations

Efficiently Generate Falsified Examples for Unit Tests with Pandera and Hypothesis

Divide a Large pandas DataFrame into Chunks

Align Pandas Objects for Effective Data Manipulation

tqdm: Add Progress Bar to Your Pandas Apply

Highlight Your pandas DataFrame for Easier Analysis

Drop a line

Get in touch

Follow Us on Social Media

Pandas

Work with Khuyen Tran

Work with Khuyen Tran