Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Building a High-Performance Data Stack with Polars and Delta Lake

Table of Contents

Building a High-Performance Data Stack with Polars and Delta Lake

What is Polars?

Polars is a Rust-based DataFrame library that is designed for high-performance data manipulation and analysis. It uses SIMD instructions, parallelization, and caching to achieve speeds that are often orders of magnitude faster than other DataFrame libraries.

What is Delta Lake?

Delta Lake is a storage format that offers a range of benefits, including ACID transactions, time travel, schema enforcement, and more. It’s designed to work seamlessly with big data processing engines like Apache Spark and can handle large amounts of data with ease.

How Do Polars and Delta Lake Work Together?

When you combine Polars and Delta Lake, you get a powerful data processing system. Polars does the heavy lifting of processing your data, while Delta Lake keeps everything organized and up-to-date.

Imagine you have a huge dataset with millions of rows. You want to group the data by category and calculate the sum of a certain column. With Polars and Delta Lake, you can do this quickly and easily.

First, you create a sample dataset:

import pandas as pd
import numpy as np
​
# Create a sample dataset
num_rows = 10_000_000
data = {
    'Cat1': np.random.choice(['A', 'B', 'C'], size=num_rows),
    'Num1': np.random.randint(low=1, high=100, size=num_rows)
}
​
df = pd.DataFrame(data)
df.head()
Cat1Num1
0A84
1C63
2B11
3A73
4B57

Next, you save the dataset to Delta Lake:

from deltalake.writer import write_deltalake
​
save_path = "tmp/data"
​
write_deltalake(save_path, df)

Then, you can use Polars to read the data from Delta Lake and perform the grouping operation:

import polars as pl 
​
pl_df = pl.read_delta(save_path)
pl_df.group_by("Cat1").sum()
shape: (3, 2)
┌──────┬───────────┐
│ Cat1 ┆ Num1      │
│ ---  ┆ ---       │
│ str  ┆ i64       │
╞══════╪═══════════╡
│ B    ┆ 166653474 │
│ C    ┆ 166660653 │
│ A    ┆ 166597835 │
└──────┴───────────┘

Time Travel

One of Delta Lake’s best features is the ability to travel back in time to previous versions of your data. This is called time travel. With Delta Lake, you can easily revert to a previous version of your data if you need to.

Let’s say you want to append some new data to the existing dataset:

new_data = pd.DataFrame({"Cat1": ["B", "C"], "Num1": [2, 3]})
​
write_deltalake(save_path, new_data, mode="append")

Now, you can use Polars to read the updated data from Delta Lake:

updated_pl_df = pl.read_delta(save_path)
updated_pl_df.tail()
shape: (5, 2)
┌──────┬──────┐
│ Cat1 ┆ Num1 │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 29   │
│ A    ┆ 41   │
│ A    ┆ 49   │
│ B    ┆ 2    │
│ C    ┆ 3    │
└──────┴──────┘

But what if you want to go back to the previous version of the data? With Delta Lake, you can easily do that by specifying the version number:

previous_pl_df = pl.read_delta(save_path, version=0)
previous_pl_df.tail()
shape: (5, 2)
┌──────┬──────┐
│ Cat1 ┆ Num1 │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 90   │
│ C    ┆ 83   │
│ A    ┆ 29   │
│ A    ┆ 41   │
│ A    ┆ 49   │
└──────┴──────┘

This will give you the original dataset before the new data is appended.

Want the full walkthrough?

Check out our in-depth guide on Polars vs Pandas: A Fast, Multi-Core Alternative for DataFrames

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran