Manage Data Archives

Building a High-Performance Data Stack with Polars and Delta Lake

Leave a Comment / Delta Lake, Manage Data, Polars / Khuyen Tran

Polars is a DataFrame library written in Rust that has blazing-fast performance. Delta Lake has helpful features including ACID transactions, time travel, schema enforcement, and more.

Combining these two tools makes the code exceptionally powerful and efficient for data processing and analysis.

To read a Delta table in a Polars DataFrame, use polars.DataFrame.read_delta.

Building a High-Performance Data Stack with Polars and Delta Lake Read More »

Ensure Pandas’ Data Integrity with Delta Lake Constraints

Leave a Comment / Delta Lake, Manage Data / Khuyen Tran

Managing data integrity and business rules in pandas DataFrames is often difficult, especially with large datasets or multiple contributors. This can lead to inconsistent or invalid data.

Delta Lake’s constraint feature solves this by enabling table-level rule definition and enforcement. This ensures only data meeting specific criteria can be added to the table.

Let’s look at a practical example:

import pandas as pd
from deltalake.writer import write_deltalake
from deltalake import DeltaTable

# Define the path for our Delta Lake table
table_path = "delta_lake"

# Create an initial DataFrame
df1 = pd.DataFrame(
[
(1, "John", 5000),
(2, "Jane", 6000),
],
columns=["employee_id", "employee_name", "salary"],
)

# Write the initial data to Delta Lake
write_deltalake(table_path, df1)

# View the initial data
df

employee_idemployee_namesalary01John500012Jane6000

Now, let’s add a constraint to ensure all salaries are positive:

table = DeltaTable(table_path)
table.alter.add_constraint({"salary_gt_0": "salary > 0"})

With this constraint in place, let’s try to add a new record with a negative salary:

df2 = pd.DataFrame(
[(3, "Alex", -200)],
columns=["employee_id", "employee_name", "salary"],
)

write_deltalake(table, df2, mode="append", engine="rust")

Running this code results in an error:

DeltaProtocolError: Invariant violations: ["Check or Invariant (salary > 0) violated by value in row: [3, Alex, -200]"]

As we can see, the constraint we added prevented the insertion of invalid data. This is incredibly powerful because it:

Enforces data integrity at the table level.

Prevents accidental insertion of invalid data.

Maintains consistency across all operations on the table.

Link to delta-rs.
Favorite

Ensure Pandas’ Data Integrity with Delta Lake Constraints Read More »

Exploring Google Trends with Pytrends API

2 Comments / Manage Data, Workflow Automation / Khuyen Tran

Google Trends provides valuable insights into search patterns and public interest over time. Pytrends, an unofficial API for Google Trends, offers a simple way to access and analyze this data programmatically.

To track a keyword’s trend on Google Search, you can use pytrends. Here’s an example that shows the interest in “data science” from 2019 to 2024:

from pytrends.request import TrendReq

pytrends = TrendReq(hl="en-US", tz=360)
pytrends.build_payload(kw_list=["data science"])

df = pytrends.interest_over_time()
df["data science"].plot(figsize=(20, 7))

Pytrends allows you to easily access various types of trend data, including interest by region, related topics, and queries, making it a powerful tool for researchers, marketers, and data analysts.

Link to pytrends.
Favorite

Exploring Google Trends with Pytrends API Read More »

Syft: Sensitive Data Collaboration Made Secure

Leave a Comment / Manage Data / Khuyen Tran

Data owners often hesitate to share sensitive data due to risks like privacy breaches, IP theft, and blackmail, hindering important work that could benefit society.

Syft enables Data Scientists to ask questions and receive answers without accessing the actual dataset. Data Owners can establish robust privacy controls, enabling collaboration while protecting sensitive information.

Link to Syft.
Favorite

Syft: Sensitive Data Collaboration Made Secure Read More »

The Lakehouse Model: Bridging the Gap Between Data Lakes and Warehouses

Leave a Comment / Manage Data, SQL / Khuyen Tran

First-generation data warehouses excelled with structured data and BI tasks but had limited support for unstructured data and were costly to scale up.

Second-generation data lakes offered scalable storage for diverse data but lacked key management features, such as ACID transactions and data versioning.

Databricks’ Lakehouse architecture combines the strengths of lakes and warehouses, including:

Supporting various data types, suitable for data science and machine learning.

Enhancing management features such as ACID transactions and data versioning.

Using cost-effective object storage, like Amazon S3, with formats like Parquet.

Maintaining data integrity via a metadata layer.

Learn more about Data Lakehouse Architecture.
Favorite

The Lakehouse Model: Bridging the Gap Between Data Lakes and Warehouses Read More »

Delta Lake: Ensuring Schema Consistency for Clean Data

Leave a Comment / Delta Lake, Manage Data / Khuyen Tran

Data Lake allows for flexible storage, but its schema is dynamically inferred during reading, which can lead to data corruption or downstream issues.

Delta Lake, on the other hand, enforces schema consistency throughout the data pipeline, ensuring that all data written to a table matches the table’s predefined schema.

This results in a clean and ready-to-use data set.

Delta Lake: Ensuring Schema Consistency for Clean Data Read More »

Grist: A Hybrid Database/Spreadsheet for Efficient Data Management

Leave a Comment / Manage Data, SQL / Khuyen Tran

Grist is a hybrid database/spreadsheet, meaning that:

🔹 Columns work like they do in databases: they are named, and they hold one kind of data.
🔹 Columns can be filled by formula, spreadsheet-style, with automatic updates when referenced cells change.

Grist: A Hybrid Database/Spreadsheet for Efficient Data Management Read More »

Enhance Query Efficiency with Z Order in Delta Lake

Leave a Comment / Delta Lake, Manage Data, SQL / Khuyen Tran

Z Order in Delta Lake organizes data in storage to minimize the amount of data that needs to be scanned for certain queries, improving query performance.

In the example above, without Z Order optimization, the query has to scan through 8 separate files to find rows where id = 5. However, with Z Order optimization, the query only needs to scan one file to locate the desired rows.

Enhance Query Efficiency with Z Order in Delta Lake Read More »

Version Your Pandas DataFrame with Delta Lake

Leave a Comment / Delta Lake, Manage Data, Pandas / Khuyen Tran

To undo errors, avoid losing data, and reproduce results, it is crucial to implement a version control system for your data.

Delta Lake simplifies pandas DataFrame versioning and allows access to prior versions for auditing and debugging.

In the example above, Delta Lake creates two versions of a DataFrame. Version 0 contains the original data, while Version 1 includes the data that was appended.

Version Your Pandas DataFrame with Delta Lake Read More »

Optimize Query Speed with Data Partitioning

Leave a Comment / Code Optimization, Delta Lake, Manage Data / Khuyen Tran

Partitioning data allows queries to target specific segments rather than scanning the entire table, which speeds up data retrieval.

The code above uses Delta Lake to select partitions from a pandas DataFrame. Partitioned data loading is approximately 24.5 times faster than loading the complete dataset and then querying a particular subset.

Optimize Query Speed with Data Partitioning Read More »

Manage Data