Newsletter Archive Archives

Code example: Build Faster Test Workflows with pytest Markers

Newsletter #242: Build Faster Test Workflows with pytest Markers

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Build Faster Test Workflows with pytest Markers

Problem:

Large projects often contain hundreds of tests, and executing them all for every minor change quickly becomes inefficient.

Solution:

Pytest markers let you group tests by type, speed, or resource usage so you can run only what matters for your current task.Quick guide to pytest markers:
Define markers in pytest.ini
Tag tests, for example: @pytest.mark.fast
Run specific tests: pytest -m fast
Skip certain tests: pytest -m “not slow”

Learn More:

Production-Ready Data Science: From Prototyping to Production with Python

Run Code

📢
ANNOUNCEMENTS

Production-Ready Data Science Is Now on Leanpub

I am excited to share that Production-Ready Data Science is now live on Leanpub!On Leanpub, you can choose your price and get updates as more examples and chapters roll out.This book dives into the real engineering skills behind dependable data systems, including:
Testing
CI and CD
Environments and packaging
Data validation and logging
Reproducible workflows
If you want to take your data work beyond notebooks and into reliable production environments, this is for you.

Get the Book

☕️
Weekly Finds

AutoViz

Data Viz

Automatically Visualize any dataset, any size with a single line of code

cognee

LLM

Memory for AI Agents in 6 lines of code

niquests

Python Utils

Simple, yet elegant, Python HTTP library: a drop-in replacement for python-requests

Favorite

Newsletter #242: Build Faster Test Workflows with pytest Markers Read More »

Code example: Polars: Lazy CSV Loading with Query Optimization

Newsletter #241: Polars: Lazy CSV Loading with Query Optimization

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Polars: Lazy CSV Loading with Query Optimization

Problem:

Pandas loads entire CSV files into memory immediately, even when you only need filtered or aggregated results.This eager evaluation wastes memory and processing time on data you’ll never use.

Solution:

Polars’ scan_csv() uses lazy evaluation to optimize queries before loading data.How scan_csv() works:
Analyzes your entire query before loading any data
Identifies which columns you actually need
Applies filters while reading the CSV file
Loads only the relevant data into memory

Full Article:

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

Run Code

View GitHub

Build Structured AI Agents with LangChain TodoList

Problem:

Complex workflows require structured planning. Without it, agents may execute subtasks out of order or miss crucial ones entirely.

Solution:

LangChain v1.0 introduces TodoListMiddleware, which gives agents automatic task planning and progress tracking.Key benefits:
Decomposes complex requests into sequential steps
Marks each task as pending, in_progress, or completed
Ensures agents follow logical execution order

Full Article:

Build Production-Ready LLM Agents with LangChain 1.0 Middleware

Run Code

View GitHub

☕️
Weekly Finds

RAGxplorer

LLM

Open-source tool to visualize your RAG embeddings and document chunks

nbQA

Python Utils

Run ruff, isort, pyupgrade, mypy, pylint, flake8, and more on Jupyter Notebooks

prometheus-eval

LLM

Evaluate your LLM’s response with specialized language models for reproducible assessment

Favorite

Newsletter #241: Polars: Lazy CSV Loading with Query Optimization Read More »

Code example: Auto-Summarize Chat History with LangChain Middleware

Newsletter #240: Auto-Summarize Chat History with LangChain Middleware

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Auto-Summarize Chat History with LangChain Middleware

Problem:

Long chat histories can quickly increase token usage, leading to higher API costs and slower responses.

Solution:

LangChain v1.0 introduces SummarizationMiddleware that automatically condenses older messages when token thresholds are exceeded.Key features:
Integrates into existing LangChain agents with minimal code changes
Automatic summarization when token limits are reached
Preserves recent context with configurable message retention
Uses efficient models for summarization (e.g., gpt-4o-mini)

Full Article:

Build Production-Ready LLM Agents with LangChain 1.0 Middleware

Run Code

View GitHub

Batch Process DataFrames with PySpark Pandas UDF Vectorization

Problem:

Traditional UDFs (User-Defined Functions) run your custom Python function on each row individually, which can significantly slow down DataFrame operations.

Solution:

Pandas UDFs solve this by batching data into chunks and applying vectorized pandas transformations across entire columns, rather than looping through rows.As a result, they can be 10 to 100 times faster on large DataFrames.

Full Article:

The Complete PySpark SQL Guide: DataFrames, Aggregations, Window Functions, and Pandas UDFs

Run Code

View GitHub

☕️
Weekly Finds

lifelines

Survival analysis in Python with Kaplan Meier, Cox regression, and parametric models

nb-clean

Python Utils

Clean Jupyter notebooks for version control by removing outputs, metadata, and execution counts

FuzzTypes

Python Utils

Pydantic extension for autocorrecting field values using fuzzy string matching

Favorite

Newsletter #240: Auto-Summarize Chat History with LangChain Middleware Read More »

Code example: Delta Lake: Insert + Update in One Operation

Newsletter #239: Delta Lake: Insert + Update in One Operation

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Delta Lake: Insert + Update in One Operation

Problem:

In pandas, implementing upserts means running 3 separate operations: filter existing records, update matches, and append new ones.Each step requires a full data scan, increasing both code complexity and execution time.

Solution:

Delta Lake’s MERGE replaces this 3-step process with a single transaction that updates existing records and inserts new ones.How it works:
Compares source data with existing table records
Updates matching records with new values
Inserts records that don’t exist yet
Executes all changes together with automatic rollback if any step fails

Full Article:

Delta Lake: Transform pandas Prototypes into Production

Run Code

View GitHub

⭐
Related Post

Delta Lake vs pandas: Stop Silent Data Corruption

Problem:

Pandas allows type coercion during DataFrame operations. A single string value can silently convert numeric columns to object dtype, breaking downstream systems and corrupting data integrity.

Solution:

Delta Lake prevents these issues through strict schema enforcement at write time, validating data types before ingestion to maintain table integrity.Other features of Delta Lake:
Time travel provides instant access to any historical data version
ACID transactions guarantee data consistency across all operations
Smart file skipping eliminates 95% of unnecessary data scanning
Incremental processing handles billion-row updates efficiently

Full Article:

Delta Lake: Transform pandas Prototypes into Production

Run Code

View GitHub

☕️
Weekly Finds

Boruta-Shap

A tree-based feature selection tool combining the Boruta algorithm with SHAP values to identify the most important features for machine learning models.

a2a-python

LLM

Official Python SDK for building agentic applications as A2A Servers following the Agent2Agent Protocol, with async support and optional integrations.

respx

Python Utils

A Python library for mocking HTTPX and HTTP Core with request pattern matching and customizable response side effects for testing purposes.

Favorite

Newsletter #239: Delta Lake: Insert + Update in One Operation Read More »

Code example: Build Human-in-the-Loop AI Agents with LangChain

Newsletter #238: Build Human-in-the-Loop AI Agents with LangChain

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Generate Time-Sortable IDs with Python 3.14’s UUID v7

Problem:

UUID4 generates purely random identifiers that lack chronological ordering.Without embedded timestamps, you need separate timestamp fields and custom sorting logic to organize records by creation time.

Solution:

Python 3.14 introduces UUID version 7 with built-in timestamp ordering.Key features:
Determine creation order by comparing two UUIDs directly
Retrieve exact creation time by extracting the embedded timestamp

Build Human-in-the-Loop AI Agents with LangChain

Problem:

Without human oversight, AI agents can make irreversible mistakes by executing risky operations like database deletions.

Solution:

LangChain v1.0’s interrupt() function pauses agent execution at critical decision points for human review.How it works:
interrupt() pauses tool execution for human review
MemorySaver checkpointer enables pause/resume functionality
Human reviews proposed action and approves or rejects
Command(resume=…) continues execution after approval
This gives you full control over critical AI decisions before they execute.

Full Article:

Run Private AI Workflows with LangChain and Ollama

Run Code

View GitHub

☕️
Weekly Finds

formulas

Data Processing

Excel formulas interpreter in Python that parses and compiles Excel formula expressions and workbooks

MindsDB

Open-source AI development platform that allows users to build, train and deploy machine learning models using SQL queries

TimescaleDB

Data Engineer

PostgreSQL extension for high-performance real-time analytics on time-series and event data

Favorite

Newsletter #238: Build Human-in-the-Loop AI Agents with LangChain Read More »

Code example: Build Clean Visualizations with Altair Grammar

Newsletter #237: Build Clean Visualizations with Altair Grammar

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Faster Data Compression with Python 3.14 Zstandard

Problem:

Compressing large datasets with gzip is slow and produces larger files.Using external compression libraries adds dependency complexity to your data pipeline.

Solution:

Python 3.14 includes built-in Zstandard compression that’s 2-3x faster than gzip with better compression ratios.Key benefits:
Native Python module (no external dependencies)
Compression levels from 1-22 for speed vs. size tradeoffs
Stream-based API for memory-efficient processing
Perfect for data archival and transfer workflows
Ideal for data scientists working with large CSV files, model checkpoints, and dataset distributions.

Run Code

Build Clean Visualizations with Altair Grammar

Problem:

Matplotlib requires manual data transformation and explicit configuration for every visual element.

Solution:

Altair uses declarative syntax based on Vega-Lite for intuitive, readable visualizations.With Altair, you can describe what you want, not how to create it:
Automatic formatting with type encoding (:T, :Q, :N, :O)
Built-in aggregations: mean(), sum(), count()
No manual groupby or date conversion
Easy chart composition and layering
Interactive features with minimal code

Full Article:

Top 6 Python Libraries for Visualization: Which One to Use

Run Code

View GitHub

☕️
Weekly Finds

pyscn

Python Utils

High-performance Python code quality analyzer built with Go. Designed for the AI-assisted development era

skills

LLM

Example Skills repository to customize Claude with agent skills for workflows and automation

SeleniumBase

Python Utils

Python framework for web automation, testing, and bypassing bot-detection mechanisms

Favorite

Newsletter #237: Build Clean Visualizations with Altair Grammar Read More »

Code example: Build Grammar Rules with PyParsing Without Regex Maintenance

Newsletter #236: Build Grammar Rules with PyParsing Without Regex Maintenance

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Build Grammar Rules with PyParsing Without Regex Maintenance

Problem:

Regular expressions can be powerful but often become verbose and hard to maintain, especially when accounting for variable whitespace or special characters.

Solution:

PyParsing offers a cleaner alternative. It lets you define grammar rules using Python classes, making the parsing logic explicit and easier to maintain.PyParsing advantages over regex:
Whitespace: Automatically handled without extra tokens
Readability: Self-documenting code structure
Data access: Use dot notation rather than numeric groups
Scalability: Combine reusable components to build complex grammars

Full Article:

Choose the Right Text Pattern Tool: Regex, Pregex, or Pyparsing

Run Code

View GitHub

⭐
Related Post

Build Self-Documenting Regex with Pregex

Problem:

Regex patterns like [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} are difficult to read and intimidating.Team members without regex expertise might struggle to understand and modify these validation patterns.

Solution:

Team members without regex expertise might struggle to understand and modify these validation patterns.Pregex transforms regex into readable Python code using descriptive components.Key benefits:
Code that explains its intent without comments
Easy modification without regex expertise
Composable patterns for complex validation
Export to regex format when needed

Full Article:

Choose the Right Text Pattern Tool: Regex, Pregex, or Pyparsing

Run Code

View GitHub

☕️
Weekly Finds

superduper

LLM

End-to-end framework for building custom AI applications and agents

pgai

LLM

A Python library that transforms PostgreSQL into a robust, production-ready retrieval engine for RAG and Agentic applications

lakeFS

Data Engineer

An open-source tool that transforms your object storage into a Git-like repository, enabling you to manage your data lake the way you manage your code

Favorite

Newsletter #236: Build Grammar Rules with PyParsing Without Regex Maintenance Read More »

Code example: Python 3.14: Type-Safe String Interpolation with t-strings

Newsletter #235: Python 3.14: Type-Safe String Interpolation with t-strings

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Python 3.14: Type-Safe String Interpolation with t-strings

Problem:

Building SQL queries with f-strings directly embeds user input into the query string, allowing attackers to inject malicious SQL commands.Parameterized queries are secure but require you to maintain query templates and value lists separately.

Solution:

Python 3.14 introduces template string literals (t-strings). Instead of returning strings, they return Template objects that safely expose interpolated values.This lets you validate and sanitize interpolated values before building the final query.

Run Code

Sync Only Changed Database Records with CloudQuery (Sponsored)

Problem:

Syncing data frequently is essential for real-time analytics and data pipelines.However, transferring large datasets between providers is resource-intensive and time-consuming, especially when syncing frequently.

Solution:

However, transferring large datasets between providers is resource-intensive and time-consuming, especially when syncing frequently.CloudQuery’s incremental sync tracks what’s already synced and fetches only the changes.How incremental sync works:
Stores last sync timestamp in a state table
Queries the source for records modified after that timestamp
Updates only changed data in the destination database
In the example above, after the initial full sync of 33 seconds, incremental runs complete in just 5 seconds.

Full Article:

Hacker News Semantic Search: Production RAG with CloudQuery and Postgres

Run Code

View GitHub

☕️
Weekly Finds

pyscn

Data Engineer

An Intelligent Python Code Quality Analyzer that performs structural analysis to help maintain code quality for AI-assisted development.

TradingAgents

LLM

A multi-agent trading framework that uses LLM-powered agents to collaboratively evaluate market conditions and inform trading decisions.

vulture

Data Engineer

Vulture finds unused code in Python programs to help clean up and improve code quality by identifying dead or unreachable code.

Favorite

Newsletter #235: Python 3.14: Type-Safe String Interpolation with t-strings Read More »

Code example: Faker: Generate Realistic Test Data with One Command

Newsletter #234: Faker: Generate Realistic Test Data with One Command

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Faker: Generate Realistic Test Data with One Command

Problem:

Creating realistic test data manually is time-consuming.

Solution:

Faker generates authentic-looking test data with single-line commands.Key features:
Realistic names, emails, and addresses
50+ language locales (en_US, vi_VN, etc.)
One-line profile generation with custom fields

Full Article:

Faker: Generate Realistic Test Data in Python with One Line of Code

Run Code

View GitHub

Persist Agent State Across Restarts with LangGraph Checkpointing

Problem:

Checkpointing is a persistence layer that maintains agent workflow state between executions.Without checkpointing, agents lose all state when systems restart, requiring users to start over with new conversations.

Solution:

With LangGraph’s checkpointing, you can persist agent state to databases, enabling:
Conversation continuity through restarts
Same conversation accessible from any application instance
Flexible persistence with PostgreSQL, SQLite, or MongoDB backends

Full Article:

Building Coordinated AI Agents with LangGraph: A Hands-On Tutorial

Run Code

View GitHub

☕️
Weekly Finds

git-who

Data Engineer

Git blame for file trees – visualize code authorship and contributions across entire directory structures

nanochat

LLM

The best ChatGPT that $100 can buy – minimal, hackable LLM implementation with full training pipeline

ManimML

Animate and visualize machine learning concepts with Manim – create neural network visualizations and educational content

Favorite

Newsletter #234: Faker: Generate Realistic Test Data with One Command Read More »

Code example: Build Self-Documenting Regex with Pregex

Newsletter #233: Build Self-Documenting Regex with Pregex

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Build Self-Documenting Regex with Pregex

Problem:

Solution:

Full Article:

Choose the Right Text Pattern Tool: Regex, Pregex, or Pyparsing

Run Code

View GitHub

⭐
Related Post

Handle Messy Data with RapidFuzz Fuzzy Matching

Problem:

Traditional regex approaches require hours of preprocessing but still break with common data variations like missing spaces, typos, or inconsistent formatting.

Solution:

RapidFuzz eliminates data cleaning overhead with intelligent fuzzy matching.Key benefits:
Automatic handling of typos, spacing, and case variations
Production-ready C++ performance for large datasets
Full spectrum of fuzzy algorithms in one library

Full Article:

Run Code

View GitHub

☕️
Weekly Finds

xlwings

Python Utils

Python library that makes it easy to call Python from Excel and vice versa, with support for Excel on Windows, macOS, and web

juvio

Python Utils

UV kernel for Jupyter with inline dependency management for notebooks

drawdb

Data Engineer

Free, simple, and intuitive online database diagram editor and SQL generator

Favorite

Newsletter #233: Build Self-Documenting Regex with Pregex Read More »

Drop a line

Get in touch

Follow Us on Social Media

Newsletter Archive

Work with Khuyen Tran

Work with Khuyen Tran