Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Newsletter Archive

Automated newsletter archive from Klaviyo campaigns

Code example: Faster Polars Queries with Programmatic Expressions

Newsletter #246: Faster Polars Queries with Programmatic Expressions

🤝
COLLABORATION

Build Safer APIs with Buf – Free Workshop

Building APIs is simple. Scaling them across teams and systems isn’t. Ensuring consistency, compatibility, and reliability quickly becomes a challenge as projects grow.Buf provides a toolkit that makes working with Protocol Buffers faster, safer, and more consistent.Join Buf for a live, one-hour workshop on building safer, more consistent APIs.When: Nov 19, 2025 • 9 AM PDT | 12 PM EDT | 5 PM BSTWhat you’ll learn:
How Protobuf makes API development safer and simpler
API design best practices for real-world systems
How to extend Protobuf to data pipelines and streaming systems

Register for the workshop

📅
Today’s Picks

Faster Polars Queries with Programmatic Expressions

Problem:

When you want to use for loops to apply similar transformations, each Polars with_columns() call processes sequentially.This prevents the optimizer from seeing the full computation plan.

Solution:

Instead, generate all Polars expressions programmatically before applying them together.This enables Polars to:
See the complete computation plan upfront
Optimize across all expressions simultaneously
Parallelize operations across CPU cores

Full Article:

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

Run Code

View GitHub

itertools.chain: Merge Lists Without Intermediate Copies

Problem:

Standard list merging with extend() or concatenation creates intermediate copies.This memory overhead becomes significant when processing large lists.

Solution:

itertools.chain() lazily merges multiple iterables without creating intermediate lists.

Full Article:

5 Essential Itertools for Data Science

Run Code

☕️
Weekly Finds

fiftyone

ML

Open-source tool for building high-quality datasets and computer vision models

llama-stack

LLM

Composable building blocks to build Llama Apps with unified API for inference, RAG, agents, and more

grip

Python Utils

Preview GitHub README.md files locally before committing them using GitHub’s markdown API

Favorite

Newsletter #246: Faster Polars Queries with Programmatic Expressions Read More »

Code example: PySpark: Avoid Double Conversions with applyInArrow

Newsletter #245: PySpark: Avoid Double Conversions with applyInArrow

📅
Today’s Picks

PySpark: Avoid Double Conversions with applyInArrow

Problem:

applyInPandas lets you apply Pandas functions in PySpark by converting data from Arrow→Pandas→Arrow for each operation.This double conversion adds serialization overhead that slows down your transformations.

Solution:

applyInArrow (introduced in PySpark 4.0.0) works directly with PyArrow tables, eliminating the Pandas conversion step entirely.This keeps data in Arrow’s columnar format throughout the pipeline, making operations significantly faster.Trade-off: PyArrow’s syntax is less intuitive than Pandas, but it’s worth it if you’re processing large datasets where performance matters.

Full Article:

PySpark SQL Complete Guide

Run Code

View GitHub


Related Post

Batch Process DataFrames with PySpark Pandas UDF Vectorization

Problem:

Traditional UDFs (User-Defined Functions) run your custom Python function on each row individually, which can significantly slow down DataFrame operations.

Solution:

Pandas UDFs solve this by batching data into chunks and applying vectorized pandas transformations across entire columns, rather than looping through rows.As a result, they can be 10 to 100 times faster on large DataFrames.

Full Article:

The Complete PySpark SQL Guide: DataFrames, Aggregations, Window Functions, and Pandas UDFs

Run Code

View GitHub

☕️
Weekly Finds

causal-learn

ML

Python package for causal discovery that implements both classical and state-of-the-art causal discovery algorithms

POT

ML

Python Optimal Transport library providing solvers for optimization problems related to signal, image processing and machine learning

qdrant

MLOps

High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI

Favorite

Newsletter #245: PySpark: Avoid Double Conversions with applyInArrow Read More »

Code example: Handle Large Data with Polars Streaming Mode

Newsletter #244: Handle Large Data with Polars Streaming Mode

📅
Today’s Picks

Handle Large Data with Polars Streaming Mode

Problem:

In Polars, the .collect() method executes a lazy query and loads the entire dataset into memory. This works well for smaller data, but once the dataset grows beyond your available RAM, it can easily crash your process.

Solution:

Add engine=”streaming” to .collect() to process large datasets in small batches without running out of memory.How it works:
Breaks the dataset into smaller, memory-friendly chunks
Processes one batch at a time while freeing memory as it goes
Combines all partial results into a single DataFrame

Full Article:

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

Run Code

View GitHub

Build Professional Python Packages with UV –package

Problem:

Python packages turn your code into reusable modules you can share across projects.But building them requires complex setup with setuptools, managing build systems, and understanding distribution mechanics.

Solution:

UV, a fast Python package installer and resolver, reduces the entire process to 2 simple commands:
uv init –package sets up your package structure instantly
uv build and uv publish to create and distribute to PyPI

Learn More:

Production-Ready Data Science: From Prototyping to Production with Python

View GitHub

☕️
Weekly Finds

whenever

Python Utils

Modern datetime library for Python that ensures correct and type-checked datetime manipulations. It is DST-safe and way faster than standard datetime libraries.

lancedb

MLOps

Developer-friendly, embedded retrieval database for AI/ML applications. The ultimate multimodal data platform designed for fast, scalable, and production-ready vector search.

grip

Python Utils

Preview GitHub README.md files locally before committing them. A command-line server that uses GitHub’s Markdown API to render local readme files.

Favorite

Newsletter #244: Handle Large Data with Polars Streaming Mode Read More »

Code example: Turn Your ML Tests Into Plain English with Behave

Newsletter #243: Turn Your ML Tests Into Plain English with Behave

📅
Today’s Picks

Turn Your ML Tests Into Plain English with Behave

Problem:

Unit testing matters in data science, but writing tests that business stakeholders can actually understand is a challenge.If they can’t read the tests, they can’t confirm the logic matches business expectations.

Solution:

Behave turns test cases into plain-English specifications using the Given/When/Then format.How to use Behave for readable tests:
Write .feature files with Given/When/Then syntax
Implement steps in Python using @given, @when, @then decorators
Run “behave” to execute tests
This lets technical and business teams stay aligned without confusion.

Full Article:

Behave: Write Readable ML Tests with Behavior-Driven Development

Run Code

View GitHub

Build Powerful Data Pipelines with DuckDB + pandas

Problem:

Pandas is great for data cleaning and feature engineering, while SQL excels at complex aggregations.But moving data from pandas to a database and back can be tedious.

Solution:

DuckDB solves this by letting you run SQL directly on pandas DataFrames and return the results back into pandas for further analysis.

Full Article:

A Deep Dive into DuckDB for Data Scientists

Run Code

View GitHub

☕️
Weekly Finds

feast

MLOps

Open source feature store for machine learning that manages existing infrastructure to productionize ML models with fast data consistency and leakage prevention

git-who

Python Utils

CLI tool for industrial-scale git blaming that shows who is responsible for entire components or subsystems in your codebase, not just individual lines

organize

Python Utils

File management automation tool for safe moving, renaming, copying files with conflict resolution, duplicate detection, and Exif tag extraction

Favorite

Newsletter #243: Turn Your ML Tests Into Plain English with Behave Read More »

Code example: Build Faster Test Workflows with pytest Markers

Newsletter #242: Build Faster Test Workflows with pytest Markers

📅
Today’s Picks

Build Faster Test Workflows with pytest Markers

Problem:

Large projects often contain hundreds of tests, and executing them all for every minor change quickly becomes inefficient.

Solution:

Pytest markers let you group tests by type, speed, or resource usage so you can run only what matters for your current task.Quick guide to pytest markers:
Define markers in pytest.ini
Tag tests, for example: @pytest.mark.fast
Run specific tests: pytest -m fast
Skip certain tests: pytest -m “not slow”

Learn More:

Production-Ready Data Science: From Prototyping to Production with Python

Run Code

📢
ANNOUNCEMENTS

Production-Ready Data Science Is Now on Leanpub

I am excited to share that Production-Ready Data Science is now live on Leanpub!On Leanpub, you can choose your price and get updates as more examples and chapters roll out.This book dives into the real engineering skills behind dependable data systems, including:
Testing
CI and CD
Environments and packaging
Data validation and logging
Reproducible workflows
If you want to take your data work beyond notebooks and into reliable production environments, this is for you.

Get the Book

☕️
Weekly Finds

AutoViz

Data Viz

Automatically Visualize any dataset, any size with a single line of code

cognee

LLM

Memory for AI Agents in 6 lines of code

niquests

Python Utils

Simple, yet elegant, Python HTTP library: a drop-in replacement for python-requests

Favorite

Newsletter #242: Build Faster Test Workflows with pytest Markers Read More »

Code example: Polars: Lazy CSV Loading with Query Optimization

Newsletter #241: Polars: Lazy CSV Loading with Query Optimization

📅
Today’s Picks

Polars: Lazy CSV Loading with Query Optimization

Problem:

Pandas loads entire CSV files into memory immediately, even when you only need filtered or aggregated results.This eager evaluation wastes memory and processing time on data you’ll never use.

Solution:

Polars’ scan_csv() uses lazy evaluation to optimize queries before loading data.How scan_csv() works:
Analyzes your entire query before loading any data
Identifies which columns you actually need
Applies filters while reading the CSV file
Loads only the relevant data into memory

Full Article:

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

Run Code

View GitHub

Build Structured AI Agents with LangChain TodoList

Problem:

Complex workflows require structured planning. Without it, agents may execute subtasks out of order or miss crucial ones entirely.

Solution:

LangChain v1.0 introduces TodoListMiddleware, which gives agents automatic task planning and progress tracking.Key benefits:
Decomposes complex requests into sequential steps
Marks each task as pending, in_progress, or completed
Ensures agents follow logical execution order

Full Article:

Build Production-Ready LLM Agents with LangChain 1.0 Middleware

Run Code

View GitHub

☕️
Weekly Finds

RAGxplorer

LLM

Open-source tool to visualize your RAG embeddings and document chunks

nbQA

Python Utils

Run ruff, isort, pyupgrade, mypy, pylint, flake8, and more on Jupyter Notebooks

prometheus-eval

LLM

Evaluate your LLM’s response with specialized language models for reproducible assessment

Favorite

Newsletter #241: Polars: Lazy CSV Loading with Query Optimization Read More »

Code example: Auto-Summarize Chat History with LangChain Middleware

Newsletter #240: Auto-Summarize Chat History with LangChain Middleware

📅
Today’s Picks

Auto-Summarize Chat History with LangChain Middleware

Problem:

Long chat histories can quickly increase token usage, leading to higher API costs and slower responses.

Solution:

LangChain v1.0 introduces SummarizationMiddleware that automatically condenses older messages when token thresholds are exceeded.Key features:
Integrates into existing LangChain agents with minimal code changes
Automatic summarization when token limits are reached
Preserves recent context with configurable message retention
Uses efficient models for summarization (e.g., gpt-4o-mini)

Full Article:

Build Production-Ready LLM Agents with LangChain 1.0 Middleware

Run Code

View GitHub

Batch Process DataFrames with PySpark Pandas UDF Vectorization

Problem:

Traditional UDFs (User-Defined Functions) run your custom Python function on each row individually, which can significantly slow down DataFrame operations.

Solution:

Pandas UDFs solve this by batching data into chunks and applying vectorized pandas transformations across entire columns, rather than looping through rows.As a result, they can be 10 to 100 times faster on large DataFrames.

Full Article:

The Complete PySpark SQL Guide: DataFrames, Aggregations, Window Functions, and Pandas UDFs

Run Code

View GitHub

☕️
Weekly Finds

lifelines

ML

Survival analysis in Python with Kaplan Meier, Cox regression, and parametric models

nb-clean

Python Utils

Clean Jupyter notebooks for version control by removing outputs, metadata, and execution counts

FuzzTypes

Python Utils

Pydantic extension for autocorrecting field values using fuzzy string matching

Favorite

Newsletter #240: Auto-Summarize Chat History with LangChain Middleware Read More »

Code example: Delta Lake: Insert + Update in One Operation

Newsletter #239: Delta Lake: Insert + Update in One Operation

📅
Today’s Picks

Delta Lake: Insert + Update in One Operation

Problem:

In pandas, implementing upserts means running 3 separate operations: filter existing records, update matches, and append new ones.Each step requires a full data scan, increasing both code complexity and execution time.

Solution:

Delta Lake’s MERGE replaces this 3-step process with a single transaction that updates existing records and inserts new ones.How it works:
Compares source data with existing table records
Updates matching records with new values
Inserts records that don’t exist yet
Executes all changes together with automatic rollback if any step fails

Full Article:

Delta Lake: Transform pandas Prototypes into Production

Run Code

View GitHub


Related Post

Delta Lake vs pandas: Stop Silent Data Corruption

Problem:

Pandas allows type coercion during DataFrame operations. A single string value can silently convert numeric columns to object dtype, breaking downstream systems and corrupting data integrity.

Solution:

Delta Lake prevents these issues through strict schema enforcement at write time, validating data types before ingestion to maintain table integrity.Other features of Delta Lake:
Time travel provides instant access to any historical data version
ACID transactions guarantee data consistency across all operations
Smart file skipping eliminates 95% of unnecessary data scanning
Incremental processing handles billion-row updates efficiently

Full Article:

Delta Lake: Transform pandas Prototypes into Production

Run Code

View GitHub

☕️
Weekly Finds

Boruta-Shap

ML

A tree-based feature selection tool combining the Boruta algorithm with SHAP values to identify the most important features for machine learning models.

a2a-python

LLM

Official Python SDK for building agentic applications as A2A Servers following the Agent2Agent Protocol, with async support and optional integrations.

respx

Python Utils

A Python library for mocking HTTPX and HTTP Core with request pattern matching and customizable response side effects for testing purposes.

Favorite

Newsletter #239: Delta Lake: Insert + Update in One Operation Read More »

Code example: Build Human-in-the-Loop AI Agents with LangChain

Newsletter #238: Build Human-in-the-Loop AI Agents with LangChain

📅
Today’s Picks

Generate Time-Sortable IDs with Python 3.14’s UUID v7

Problem:

UUID4 generates purely random identifiers that lack chronological ordering.Without embedded timestamps, you need separate timestamp fields and custom sorting logic to organize records by creation time.

Solution:

Python 3.14 introduces UUID version 7 with built-in timestamp ordering.Key features:
Determine creation order by comparing two UUIDs directly
Retrieve exact creation time by extracting the embedded timestamp

Build Human-in-the-Loop AI Agents with LangChain

Problem:

Without human oversight, AI agents can make irreversible mistakes by executing risky operations like database deletions.

Solution:

LangChain v1.0’s interrupt() function pauses agent execution at critical decision points for human review.How it works:
interrupt() pauses tool execution for human review
MemorySaver checkpointer enables pause/resume functionality
Human reviews proposed action and approves or rejects
Command(resume=…) continues execution after approval
This gives you full control over critical AI decisions before they execute.

Full Article:

Run Private AI Workflows with LangChain and Ollama

Run Code

View GitHub

☕️
Weekly Finds

formulas

Data Processing

Excel formulas interpreter in Python that parses and compiles Excel formula expressions and workbooks

MindsDB

ML

Open-source AI development platform that allows users to build, train and deploy machine learning models using SQL queries

TimescaleDB

Data Engineer

PostgreSQL extension for high-performance real-time analytics on time-series and event data

Favorite

Newsletter #238: Build Human-in-the-Loop AI Agents with LangChain Read More »

Code example: Build Clean Visualizations with Altair Grammar

Newsletter #237: Build Clean Visualizations with Altair Grammar

📅
Today’s Picks

Faster Data Compression with Python 3.14 Zstandard

Problem:

Compressing large datasets with gzip is slow and produces larger files.Using external compression libraries adds dependency complexity to your data pipeline.

Solution:

Python 3.14 includes built-in Zstandard compression that’s 2-3x faster than gzip with better compression ratios.Key benefits:
Native Python module (no external dependencies)
Compression levels from 1-22 for speed vs. size tradeoffs
Stream-based API for memory-efficient processing
Perfect for data archival and transfer workflows
Ideal for data scientists working with large CSV files, model checkpoints, and dataset distributions.

Run Code

Build Clean Visualizations with Altair Grammar

Problem:

Matplotlib requires manual data transformation and explicit configuration for every visual element.

Solution:

Altair uses declarative syntax based on Vega-Lite for intuitive, readable visualizations.With Altair, you can describe what you want, not how to create it:
Automatic formatting with type encoding (:T, :Q, :N, :O)
Built-in aggregations: mean(), sum(), count()
No manual groupby or date conversion
Easy chart composition and layering
Interactive features with minimal code

Full Article:

Top 6 Python Libraries for Visualization: Which One to Use

Run Code

View GitHub

☕️
Weekly Finds

pyscn

Python Utils

High-performance Python code quality analyzer built with Go. Designed for the AI-assisted development era

skills

LLM

Example Skills repository to customize Claude with agent skills for workflows and automation

SeleniumBase

Python Utils

Python framework for web automation, testing, and bypassing bot-detection mechanisms

Favorite

Newsletter #237: Build Clean Visualizations with Altair Grammar Read More »

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran