Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Manage Data

Efficient Data Updates and Scanning with Delta Lake

Every time new data is appended to an existing Delta table, a new Parquet file is generated. This allows data to be ingested incrementally without having to rewrite the entire dataset.

As files accumulate, read operations may surge. The compact function merges small files into larger ones, enhancing scanning performance.

Combining incremental processing with the compact function enables efficient data updates and scans as your dataset expands.

Efficient Data Updates and Scanning with Delta Lake Read More »

The Best Way to Append Mismatched Data to Parquet Tables

Appending mismatched data to a Parquet table involves reading the existing data, concatenating it with the new data, and overwriting the existing Parquet file.

This approach can be expensive and may lead to schema inconsistencies.

With Delta Lake, you can effortlessly append DataFrames with extra columns while ensuring the preservation of your data’s schema.

The Best Way to Append Mismatched Data to Parquet Tables Read More »

Efficient Data Appending in Parquet Files: Delta Lake vs. Pandas

Appending data to an existing Parquet file using pandas involves loading the existing table and merging the new data with the existing table.

This process can be time-consuming and memory-intensive.

With Delta Lake, you can add, remove, or modify columns without the need to recreate the entire table.

Efficient Data Appending in Parquet Files: Delta Lake vs. Pandas Read More »

Simplify Table Merge Operations with Delta Lake

Merging two datasets and performing both insert and update operations can be a complex task. Delta Lake makes it easy to perform multiple data manipulation operations during a merge operation.

The following code demonstrates merging two datasets using Delta Lake:

✔️ If a match is found, the last_talk value in people_table is updated with the corresponding value from new_df 

✖️ If a match is not found and the last_talk value in people_table is older than 30 days, the status column is updated to ‘rejected’.

Simplify Table Merge Operations with Delta Lake Read More »

Fluke: The Easiest Way to Move Data Around

Data scientists often need to transfer data between locations, such as a remote server to cloud storage. However, many Python libraries require a lot of boilerplate code to handle HTTP/SSH connections and iterate directories. 

This can be cumbersome for those who want to transfer files easily. Fluke offers a simple API that allows users to interact with remote data in a few lines of code. 

Link to Fluke.
Favorite

Fluke: The Easiest Way to Move Data Around Read More »

DVC: A Data Version Control Tool for your Data Science Projects

Git is a powerful tool to go back and forth different versions of your code. Is there a way that you can also control different versions of your data?
That is when DVC comes in handy. With DVC, you can keep the information about different versions of your data in Git while storing your original data somewhere else.
It is essentially like Git but is used for data. The code above shows how to use DVC.
Find step-by-step instructions on how to use DVC in my article.Favorite

DVC: A Data Version Control Tool for your Data Science Projects Read More »

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran