Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Newsletter Archive

Automated newsletter archive from Klaviyo campaigns

Code example: Polars: Lazy CSV Loading with Query Optimization

Newsletter #241: Polars: Lazy CSV Loading with Query Optimization

📅
Today’s Picks

Polars: Lazy CSV Loading with Query Optimization

Problem:

Pandas loads entire CSV files into memory immediately, even when you only need filtered or aggregated results.This eager evaluation wastes memory and processing time on data you’ll never use.

Solution:

Polars’ scan_csv() uses lazy evaluation to optimize queries before loading data.How scan_csv() works:
Analyzes your entire query before loading any data
Identifies which columns you actually need
Applies filters while reading the CSV file
Loads only the relevant data into memory

Full Article:

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

Run Code

View GitHub

Build Structured AI Agents with LangChain TodoList

Problem:

Complex workflows require structured planning. Without it, agents may execute subtasks out of order or miss crucial ones entirely.

Solution:

LangChain v1.0 introduces TodoListMiddleware, which gives agents automatic task planning and progress tracking.Key benefits:
Decomposes complex requests into sequential steps
Marks each task as pending, in_progress, or completed
Ensures agents follow logical execution order

Full Article:

Build Production-Ready LLM Agents with LangChain 1.0 Middleware

Run Code

View GitHub

☕️
Weekly Finds

RAGxplorer

LLM

Open-source tool to visualize your RAG embeddings and document chunks

nbQA

Python Utils

Run ruff, isort, pyupgrade, mypy, pylint, flake8, and more on Jupyter Notebooks

prometheus-eval

LLM

Evaluate your LLM’s response with specialized language models for reproducible assessment

Favorite

Newsletter #241: Polars: Lazy CSV Loading with Query Optimization Read More »

Code example: Auto-Summarize Chat History with LangChain Middleware

Newsletter #240: Auto-Summarize Chat History with LangChain Middleware

📅
Today’s Picks

Auto-Summarize Chat History with LangChain Middleware

Problem:

Long chat histories can quickly increase token usage, leading to higher API costs and slower responses.

Solution:

LangChain v1.0 introduces SummarizationMiddleware that automatically condenses older messages when token thresholds are exceeded.Key features:
Integrates into existing LangChain agents with minimal code changes
Automatic summarization when token limits are reached
Preserves recent context with configurable message retention
Uses efficient models for summarization (e.g., gpt-4o-mini)

Full Article:

Build Production-Ready LLM Agents with LangChain 1.0 Middleware

Run Code

View GitHub

Batch Process DataFrames with PySpark Pandas UDF Vectorization

Problem:

Traditional UDFs (User-Defined Functions) run your custom Python function on each row individually, which can significantly slow down DataFrame operations.

Solution:

Pandas UDFs solve this by batching data into chunks and applying vectorized pandas transformations across entire columns, rather than looping through rows.As a result, they can be 10 to 100 times faster on large DataFrames.

Full Article:

The Complete PySpark SQL Guide: DataFrames, Aggregations, Window Functions, and Pandas UDFs

Run Code

View GitHub

☕️
Weekly Finds

lifelines

ML

Survival analysis in Python with Kaplan Meier, Cox regression, and parametric models

nb-clean

Python Utils

Clean Jupyter notebooks for version control by removing outputs, metadata, and execution counts

FuzzTypes

Python Utils

Pydantic extension for autocorrecting field values using fuzzy string matching

Favorite

Newsletter #240: Auto-Summarize Chat History with LangChain Middleware Read More »

Code example: Delta Lake: Insert + Update in One Operation

Newsletter #239: Delta Lake: Insert + Update in One Operation

📅
Today’s Picks

Delta Lake: Insert + Update in One Operation

Problem:

In pandas, implementing upserts means running 3 separate operations: filter existing records, update matches, and append new ones.Each step requires a full data scan, increasing both code complexity and execution time.

Solution:

Delta Lake’s MERGE replaces this 3-step process with a single transaction that updates existing records and inserts new ones.How it works:
Compares source data with existing table records
Updates matching records with new values
Inserts records that don’t exist yet
Executes all changes together with automatic rollback if any step fails

Full Article:

Delta Lake: Transform pandas Prototypes into Production

Run Code

View GitHub


Related Post

Delta Lake vs pandas: Stop Silent Data Corruption

Problem:

Pandas allows type coercion during DataFrame operations. A single string value can silently convert numeric columns to object dtype, breaking downstream systems and corrupting data integrity.

Solution:

Delta Lake prevents these issues through strict schema enforcement at write time, validating data types before ingestion to maintain table integrity.Other features of Delta Lake:
Time travel provides instant access to any historical data version
ACID transactions guarantee data consistency across all operations
Smart file skipping eliminates 95% of unnecessary data scanning
Incremental processing handles billion-row updates efficiently

Full Article:

Delta Lake: Transform pandas Prototypes into Production

Run Code

View GitHub

☕️
Weekly Finds

Boruta-Shap

ML

A tree-based feature selection tool combining the Boruta algorithm with SHAP values to identify the most important features for machine learning models.

a2a-python

LLM

Official Python SDK for building agentic applications as A2A Servers following the Agent2Agent Protocol, with async support and optional integrations.

respx

Python Utils

A Python library for mocking HTTPX and HTTP Core with request pattern matching and customizable response side effects for testing purposes.

Favorite

Newsletter #239: Delta Lake: Insert + Update in One Operation Read More »

Code example: Build Human-in-the-Loop AI Agents with LangChain

Newsletter #238: Build Human-in-the-Loop AI Agents with LangChain

📅
Today’s Picks

Generate Time-Sortable IDs with Python 3.14’s UUID v7

Problem:

UUID4 generates purely random identifiers that lack chronological ordering.Without embedded timestamps, you need separate timestamp fields and custom sorting logic to organize records by creation time.

Solution:

Python 3.14 introduces UUID version 7 with built-in timestamp ordering.Key features:
Determine creation order by comparing two UUIDs directly
Retrieve exact creation time by extracting the embedded timestamp

Build Human-in-the-Loop AI Agents with LangChain

Problem:

Without human oversight, AI agents can make irreversible mistakes by executing risky operations like database deletions.

Solution:

LangChain v1.0’s interrupt() function pauses agent execution at critical decision points for human review.How it works:
interrupt() pauses tool execution for human review
MemorySaver checkpointer enables pause/resume functionality
Human reviews proposed action and approves or rejects
Command(resume=…) continues execution after approval
This gives you full control over critical AI decisions before they execute.

Full Article:

Run Private AI Workflows with LangChain and Ollama

Run Code

View GitHub

☕️
Weekly Finds

formulas

Data Processing

Excel formulas interpreter in Python that parses and compiles Excel formula expressions and workbooks

MindsDB

ML

Open-source AI development platform that allows users to build, train and deploy machine learning models using SQL queries

TimescaleDB

Data Engineer

PostgreSQL extension for high-performance real-time analytics on time-series and event data

Favorite

Newsletter #238: Build Human-in-the-Loop AI Agents with LangChain Read More »

Code example: Build Clean Visualizations with Altair Grammar

Newsletter #237: Build Clean Visualizations with Altair Grammar

📅
Today’s Picks

Faster Data Compression with Python 3.14 Zstandard

Problem:

Compressing large datasets with gzip is slow and produces larger files.Using external compression libraries adds dependency complexity to your data pipeline.

Solution:

Python 3.14 includes built-in Zstandard compression that’s 2-3x faster than gzip with better compression ratios.Key benefits:
Native Python module (no external dependencies)
Compression levels from 1-22 for speed vs. size tradeoffs
Stream-based API for memory-efficient processing
Perfect for data archival and transfer workflows
Ideal for data scientists working with large CSV files, model checkpoints, and dataset distributions.

Run Code

Build Clean Visualizations with Altair Grammar

Problem:

Matplotlib requires manual data transformation and explicit configuration for every visual element.

Solution:

Altair uses declarative syntax based on Vega-Lite for intuitive, readable visualizations.With Altair, you can describe what you want, not how to create it:
Automatic formatting with type encoding (:T, :Q, :N, :O)
Built-in aggregations: mean(), sum(), count()
No manual groupby or date conversion
Easy chart composition and layering
Interactive features with minimal code

Full Article:

Top 6 Python Libraries for Visualization: Which One to Use

Run Code

View GitHub

☕️
Weekly Finds

pyscn

Python Utils

High-performance Python code quality analyzer built with Go. Designed for the AI-assisted development era

skills

LLM

Example Skills repository to customize Claude with agent skills for workflows and automation

SeleniumBase

Python Utils

Python framework for web automation, testing, and bypassing bot-detection mechanisms

Favorite

Newsletter #237: Build Clean Visualizations with Altair Grammar Read More »

Code example: Build Grammar Rules with PyParsing Without Regex Maintenance

Newsletter #236: Build Grammar Rules with PyParsing Without Regex Maintenance

📅
Today’s Picks

Build Grammar Rules with PyParsing Without Regex Maintenance

Problem:

Regular expressions can be powerful but often become verbose and hard to maintain, especially when accounting for variable whitespace or special characters.

Solution:

PyParsing offers a cleaner alternative. It lets you define grammar rules using Python classes, making the parsing logic explicit and easier to maintain.PyParsing advantages over regex:
Whitespace: Automatically handled without extra tokens
Readability: Self-documenting code structure
Data access: Use dot notation rather than numeric groups
Scalability: Combine reusable components to build complex grammars

Full Article:

Choose the Right Text Pattern Tool: Regex, Pregex, or Pyparsing

Run Code

View GitHub


Related Post

Build Self-Documenting Regex with Pregex

Problem:

Regex patterns like [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} are difficult to read and intimidating.Team members without regex expertise might struggle to understand and modify these validation patterns.

Solution:

Team members without regex expertise might struggle to understand and modify these validation patterns.Pregex transforms regex into readable Python code using descriptive components.Key benefits:
Code that explains its intent without comments
Easy modification without regex expertise
Composable patterns for complex validation
Export to regex format when needed

Full Article:

Choose the Right Text Pattern Tool: Regex, Pregex, or Pyparsing

Run Code

View GitHub

☕️
Weekly Finds

superduper

LLM

End-to-end framework for building custom AI applications and agents

pgai

LLM

A Python library that transforms PostgreSQL into a robust, production-ready retrieval engine for RAG and Agentic applications

lakeFS

Data Engineer

An open-source tool that transforms your object storage into a Git-like repository, enabling you to manage your data lake the way you manage your code

Favorite

Newsletter #236: Build Grammar Rules with PyParsing Without Regex Maintenance Read More »

Code example: Python 3.14: Type-Safe String Interpolation with t-strings

Newsletter #235: Python 3.14: Type-Safe String Interpolation with t-strings

📅
Today’s Picks

Python 3.14: Type-Safe String Interpolation with t-strings

Problem:

Building SQL queries with f-strings directly embeds user input into the query string, allowing attackers to inject malicious SQL commands.Parameterized queries are secure but require you to maintain query templates and value lists separately.

Solution:

Python 3.14 introduces template string literals (t-strings). Instead of returning strings, they return Template objects that safely expose interpolated values.This lets you validate and sanitize interpolated values before building the final query.

Run Code

Sync Only Changed Database Records with CloudQuery (Sponsored)

Problem:

Syncing data frequently is essential for real-time analytics and data pipelines.However, transferring large datasets between providers is resource-intensive and time-consuming, especially when syncing frequently.

Solution:

However, transferring large datasets between providers is resource-intensive and time-consuming, especially when syncing frequently.CloudQuery’s incremental sync tracks what’s already synced and fetches only the changes.How incremental sync works:
Stores last sync timestamp in a state table
Queries the source for records modified after that timestamp
Updates only changed data in the destination database
In the example above, after the initial full sync of 33 seconds, incremental runs complete in just 5 seconds.

Full Article:

Hacker News Semantic Search: Production RAG with CloudQuery and Postgres

Run Code

View GitHub

☕️
Weekly Finds

pyscn

Data Engineer

An Intelligent Python Code Quality Analyzer that performs structural analysis to help maintain code quality for AI-assisted development.

TradingAgents

LLM

A multi-agent trading framework that uses LLM-powered agents to collaboratively evaluate market conditions and inform trading decisions.

vulture

Data Engineer

Vulture finds unused code in Python programs to help clean up and improve code quality by identifying dead or unreachable code.

Favorite

Newsletter #235: Python 3.14: Type-Safe String Interpolation with t-strings Read More »

Code example: Faker: Generate Realistic Test Data with One Command

Newsletter #234: Faker: Generate Realistic Test Data with One Command

📅
Today’s Picks

Faker: Generate Realistic Test Data with One Command

Problem:

Creating realistic test data manually is time-consuming.

Solution:

Faker generates authentic-looking test data with single-line commands.Key features:
Realistic names, emails, and addresses
50+ language locales (en_US, vi_VN, etc.)
One-line profile generation with custom fields

Full Article:

Faker: Generate Realistic Test Data in Python with One Line of Code

Run Code

View GitHub

Persist Agent State Across Restarts with LangGraph Checkpointing

Problem:

Checkpointing is a persistence layer that maintains agent workflow state between executions.Without checkpointing, agents lose all state when systems restart, requiring users to start over with new conversations.

Solution:

With LangGraph’s checkpointing, you can persist agent state to databases, enabling:
Conversation continuity through restarts
Same conversation accessible from any application instance
Flexible persistence with PostgreSQL, SQLite, or MongoDB backends

Full Article:

Building Coordinated AI Agents with LangGraph: A Hands-On Tutorial

Run Code

View GitHub

☕️
Weekly Finds

git-who

Data Engineer

Git blame for file trees – visualize code authorship and contributions across entire directory structures

nanochat

LLM

The best ChatGPT that $100 can buy – minimal, hackable LLM implementation with full training pipeline

ManimML

ML

Animate and visualize machine learning concepts with Manim – create neural network visualizations and educational content

Favorite

Newsletter #234: Faker: Generate Realistic Test Data with One Command Read More »

Code example: Build Self-Documenting Regex with Pregex

Newsletter #233: Build Self-Documenting Regex with Pregex

📅
Today’s Picks

Build Self-Documenting Regex with Pregex

Problem:

Regex patterns like [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} are difficult to read and intimidating.Team members without regex expertise might struggle to understand and modify these validation patterns.

Solution:

Team members without regex expertise might struggle to understand and modify these validation patterns.Pregex transforms regex into readable Python code using descriptive components.Key benefits:
Code that explains its intent without comments
Easy modification without regex expertise
Composable patterns for complex validation
Export to regex format when needed

Full Article:

Choose the Right Text Pattern Tool: Regex, Pregex, or Pyparsing

Run Code

View GitHub


Related Post

Handle Messy Data with RapidFuzz Fuzzy Matching

Problem:

Traditional regex approaches require hours of preprocessing but still break with common data variations like missing spaces, typos, or inconsistent formatting.

Solution:

RapidFuzz eliminates data cleaning overhead with intelligent fuzzy matching.Key benefits:
Automatic handling of typos, spacing, and case variations
Production-ready C++ performance for large datasets
Full spectrum of fuzzy algorithms in one library

Full Article:

4 Text Similarity Tools: When Regex Isn’t Enough

Run Code

View GitHub

☕️
Weekly Finds

xlwings

Python Utils

Python library that makes it easy to call Python from Excel and vice versa, with support for Excel on Windows, macOS, and web

juvio

Python Utils

UV kernel for Jupyter with inline dependency management for notebooks

drawdb

Data Engineer

Free, simple, and intuitive online database diagram editor and SQL generator

Favorite

Newsletter #233: Build Self-Documenting Regex with Pregex Read More »

Code example: Build Data Analysis with LangChain Pandas Agent

Newsletter #232: Build Data Analysis with LangChain Pandas Agent

📅
Today’s Picks

Build Data Analysis with LangChain Pandas Agent

Problem:

Do you find yourself writing the same pandas correlation, groupby, and filtering code repeatedly for data exploration?Complex, multi-step analyses often involve tedious manual calculations and comparisons, pulling data scientists away from higher-value tasks like modeling and insight generation.

Solution:

LangChain Pandas DataFrame Agent lets you analyze data using natural language, eliminating repetitive code and speeding up your workflow.Key capabilities:
Ask complex analytical questions in plain English
Multi-step analysis in single requests
Get results with automatic explanations of methodology
Select from multiple AI models based on your query complexity

Full Article:

Run Private AI Workflows with LangChain and Ollama

Run Code

View GitHub

Faster Type Checking with Ty’s Rust Engine

Problem:

Traditional type checkers like mypy are slow on large codebases, making iteration cycles longer and development less efficient.

Solution:

Ty is a Rust-based type checker that provides instant feedback on type errors.When testing the FastAPI codebase, Ty completes type checking 9x faster than mypy.Key benefits:
Significantly faster than mypy/pyright on large codebases
Auto-checks every save for immediate feedback while coding
Real-time IDE integration for VS Code and popular editors
Zero setup: run with uvx instantly, respects .gitignore automatically

View GitHub

☕️
Weekly Finds

hyperfine

Python Utils

A command-line benchmarking tool for measuring the execution time of commands with statistical analysis across multiple runs

SurfSense

LLM

Open Source Alternative to NotebookLM / Perplexity / Glean, connected to external sources such as search engines (Tavily, Linkup), Slack, Linear, Notion, YouTube, GitHub and more

stanza

ML

Stanford NLP Python library for tokenization, sentence segmentation, part-of-speech tagging, named entity recognition, dependency parsing, and more

Favorite

Newsletter #232: Build Data Analysis with LangChain Pandas Agent Read More »

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran