Pandas Archives

Version Your Pandas DataFrame with Delta Lake

Leave a Comment / Delta Lake, Manage Data, Pandas / Khuyen Tran

To undo errors, avoid losing data, and reproduce results, it is crucial to implement a version control system for your data.

Delta Lake simplifies pandas DataFrame versioning and allows access to prior versions for auditing and debugging.

In the example above, Delta Lake creates two versions of a DataFrame. Version 0 contains the original data, while Version 1 includes the data that was appended.

Version Your Pandas DataFrame with Delta Lake Read More »

Apply Multiple Functions to a DataFrame with Pipe

Leave a Comment / Pandas / Khuyen Tran

To increase code readability when applying multiple functions to a DataFrame, use the pandas.DataFrame.pipe method.

Apply Multiple Functions to a DataFrame with Pipe Read More »

Polars vs. Pandas for CSV Loading and Filtering

Leave a Comment / Pandas, Polars / Khuyen Tran

The read_csv method in Pandas loads all rows of the dataset into the DataFrame before filtering to remove all unwanted rows.

On the other hand, the scan_csv method in Polars delays execution and optimizes the operation until the collect method is called.

This approach accelerates code execution, particularly when handling large datasets.

Polars vs. Pandas for CSV Loading and Filtering Read More »

Seamless Tracking of Changes in Pandas DataFrame with Delta Lake

Leave a Comment / Delta Lake, Manage Data, Pandas / Khuyen Tran

Maintaining a consistent record of database changes is crucial to recover data in the event of system failures or investigating security breaches.

Delta Lake enables seamless tracking of changes made to a pandas DataFrame such as creation time, size, and statistics.

Seamless Tracking of Changes in Pandas DataFrame with Delta Lake Read More »

Efficient Data Appending in Parquet Files: Delta Lake vs. Pandas

Leave a Comment / Delta Lake, Manage Data, Pandas / Khuyen Tran

Appending data to an existing Parquet file using pandas involves loading the existing table and merging the new data with the existing table.

This process can be time-consuming and memory-intensive.

With Delta Lake, you can add, remove, or modify columns without the need to recreate the entire table.

Efficient Data Appending in Parquet Files: Delta Lake vs. Pandas Read More »

PandasAI: Gain Insights From Your pandas DataFrame With AI

Leave a Comment / LLM Tools, Pandas / Khuyen Tran

If you want to quickly gain insights from your pandas DataFrame with AI, use PandasAI.

PandasAI serves as:
✅ A tool to analyze your DataFrame
❌ Not a tool to process your DataFrame

PandasAI: Gain Insights From Your pandas DataFrame With AI Read More »

Overwrite Partitions of a pandas DataFrame with Delta Lake

Leave a Comment / Delta Lake, Manage Data, Pandas / Khuyen Tran

If you need to modify a specific subset of your pandas DataFrame, such as yesterday’s data, it is not possible to overwrite only that partition. Instead, you have to load the entire DataFrame into memory as a workaround solution.

Delta Lake makes it easy to overwrite partitions of a pandas DataFrame.

Overwrite Partitions of a pandas DataFrame with Delta Lake Read More »

Raise an Exception for a Chained Assignment in pandas

Leave a Comment / Pandas / Khuyen Tran

Pandas allows chained assignments, which involve performing multiple indexing operations in a single statement, but they can lead to unexpected results or errors.

The statement above fails to modify the values in df as intended, but it doesn’t throw an error.

Setting pd.options.mode.chained_assignment to 'raise' will cause pandas to raise an exception if a chained assignment occurs.

My previous tips on pandas.

Raise an Exception for a Chained Assignment in pandas Read More »

Include All Rows When Merging Two DataFrames

Leave a Comment / Pandas / Khuyen Tran

df.merge only includes rows with matching values in both DataFrames. If you want to include all rows from both DataFrames, use how='outer'.My previous tips on pandas.

Include All Rows When Merging Two DataFrames Read More »

Optimizing Memory Usage in a pandas DataFrame with infer_objects

Leave a Comment / Pandas / Khuyen Tran

pandas DataFrames that contain columns of mixed data types are stored in a more general format (such as object), resulting in inefficient memory usage and slower computation times.

df.infer_objects() infers the true data types of columns in a DataFrame, which helps optimize memory usage in your code.

In the code above, df.infer_objects() converts the data type of “col1” from object to int64, saving approximately 27 MB of memory.

My previous tips on pandas.

Optimizing Memory Usage in a pandas DataFrame with infer_objects Read More »

Pandas

Version Your Pandas DataFrame with Delta Lake

Apply Multiple Functions to a DataFrame with Pipe

Polars vs. Pandas for CSV Loading and Filtering

Seamless Tracking of Changes in Pandas DataFrame with Delta Lake

Efficient Data Appending in Parquet Files: Delta Lake vs. Pandas

PandasAI: Gain Insights From Your pandas DataFrame With AI

Overwrite Partitions of a pandas DataFrame with Delta Lake

Raise an Exception for a Chained Assignment in pandas

Include All Rows When Merging Two DataFrames

Optimizing Memory Usage in a pandas DataFrame with infer_objects

Drop a line

Get in touch

Follow Us on Social Media

Pandas

Work with Khuyen Tran

Work with Khuyen Tran