Newsletter Archive Archives

Newsletter #249: prek: Faster, Leaner Pre-Commit Hooks (Rust-Powered)

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

LangChain v1.0: Automate Tool Selection for Faster Agents

Problem
Agents with many tools waste tokens by sending all tool descriptions with every request.
This wastes tokens on irrelevant tool descriptions, making responses slower and more expensive.
Solution
LangChain v1.0 introduces LLMToolSelectorMiddleware that pre-filters relevant tools using a smaller model.
Key features:

Pre-filter tools using cheaper models like GPT-4o-mini
Limit tools sent to main agent (e.g., 3 most relevant)
Preserve critical tools with always_include parameter

📖 View Full Article

🧪 Run code

⭐ View GitHub

prek: Faster, Leaner Pre-Commit Hooks (Rust-Powered)

Problem
pre-commit is a framework for managing Git hooks that automatically run code quality checks before commits.
However, installing these hook environments (linters, formatters, etc.) can be slow and disk-intensive, especially in CI/CD pipelines where speed matters.
Solution
prek is a drop-in replacement for pre-commit that installs hook environments significantly faster while using 50% less disk space.
Built with Rust for maximum performance, prek reduces cache storage from 1.6GB to 810MB (benchmarked on Apache Airflow repository) without changing your workflow.
Key benefits:

Uses your existing .pre-commit-config.yaml files
Commands mirror pre-commit syntax (prek install-hooks, prek run)
Monorepo support with selector syntax for targeting specific projects or hooks
Install as a single binary with no dependencies

No configuration changes needed – just replace the command.

⭐ View GitHub

☕️ Weekly Finds

deepagents
[LLM]
– Build advanced AI agents with context isolation through sub-agent delegation. Features virtual file system for context offloading, specialized sub-agents with focused tool sets, and sophisticated agent architecture for real-world research and analysis tasks.

mcp-gateway
[MLOps]
– Docker MCP CLI plugin / MCP Gateway for production-grade AI agent stack. Enables multi-agent orchestration, intelligent interceptors, and enterprise security with Docker integration.

nbQA
[Python Utils]
– Run ruff, isort, pyupgrade, mypy, pylint, flake8, and more on Jupyter Notebooks. Command-line tool to run linters and formatters over Python code in Jupyter notebooks.

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Newsletter #249: prek: Faster, Leaner Pre-Commit Hooks (Rust-Powered) Read More »

Code example: Build Mathematical Animations with Manim in Python

Newsletter #248: Build Mathematical Animations with Manim in Python – Fixed

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Build Mathematical Animations with Manim in Python

Problem
Static slides can only go so far when you’re explaining complex concepts.
Dynamic visuals make abstract ideas clearer, more engaging, and easier to understand.
Solution
Manim gives you the power to create professional mathematical animations in Python, just like the ones you see in 3Blue1Brown’s videos.
In the code below, Manim transforms equations into smooth visual steps:

Define equation steps using MathTex with LaTeX notation
Animate equation transformations with the Transform class
Control animation flow with play() and wait() methods
Render output with simple command: manim -p -ql script.py

📖 View Full Article

⭐ View GitHub

☕️ Weekly Finds

fast-langdetect
[Python Utils]
– 80x faster and 95% accurate language identification with Fasttext

FuncToWeb
[Python Utils]
– Transform any Python function into a web interface automatically

graphic-walker
[Data Viz]
– An open source alternative to Tableau for data exploration and visualization

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #248: Build Mathematical Animations with Manim in Python – Fixed Read More »

Newsletter #247: whenever: Simple Python Timezone Conversion

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

Build Safer APIs with Buf – Free Workshop
Building APIs is simple. Scaling them across teams and systems isn’t. Ensuring consistency, compatibility, and reliability quickly becomes a challenge as projects grow.
Buf provides a toolkit that makes working with Protocol Buffers faster, safer, and more consistent.
Join Buf for a live, one-hour workshop on building safer, more consistent APIs.
When: Nov 19, 2025 • 9 AM PDT | 12 PM EDT | 5 PM BST
What you’ll learn:

How Protobuf makes API development safer and simpler
API design best practices for real-world systems
How to extend Protobuf to data pipelines and streaming systems

→ Register for the workshop

📅 Today’s Picks

whenever: Simple Python Timezone Conversion

Problem
Adding 8 hours to 10pm shouldn’t give you the wrong morning time, but with Python’s datetime, it can.
The standard library fails during DST transitions, returning incorrect offsets when clocks change for daylight saving.
Solution
Whenever provides simple, explicit timezone conversion methods with clear semantics.
Key benefits:

DST-safe arithmetic with automatic offset adjustment
Type safety prevents naive/aware datetime bugs
Clean timezone conversions with .to_tz()
Nanosecond precision for deltas and timestamps
Pydantic integration for serialization

🧪 Run code

⭐ View GitHub

Build Readable Scatter Plots with adjustText Auto-Positioning

Problem
Text labels in matplotlib scatter plots frequently overlap with each other and data points, creating unreadable visualizations.
Manually repositioning each label to avoid overlaps is tedious and time-consuming.
Solution
adjustText automatically repositions labels to eliminate overlaps while connecting them to data points with arrows.
All you need is to collect your text objects and call adjust_text() with optional arrow styling.

🧪 Run code

⭐ View GitHub

📢 ANNOUNCEMENTS

Featured on LeanPub: Production-Ready Data Science
My book Production-Ready Data Science was featured on the LeanPub home page!
LeanPub is a leading platform for publishing and selling self-published technical books, so it’s truly an honor to see my work highlighted there.
The book shares everything I’ve learned about turning data science prototypes into reliable, production-ready systems, from managing dependencies to automating workflows.
Thank you to everyone who has purchased or shared it. Your support means everything.
The book is currently on sale for 58% off until November 16.

→ Get Your Copy Now (58% Off)

☕️ Weekly Finds

featuretools
[ML]
– An open source python library for automated feature engineering

datachain
[Data Processing]
– ETL, Analytics, Versioning for Unstructured Data – AI-data warehouse to enrich, transform and analyze data from cloud storages

logfire
[Python Utils]
– Uncomplicated Observability for Python and beyond – an observability platform built on OpenTelemetry from the team behind Pydantic

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #247: whenever: Simple Python Timezone Conversion Read More »

Newsletter #246: Faster Polars Queries with Programmatic Expressions

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

How Protobuf makes API development safer and simpler
API design best practices for real-world systems
How to extend Protobuf to data pipelines and streaming systems

→ Register for the workshop

📅 Today’s Picks

Faster Polars Queries with Programmatic Expressions

Problem
When you want to use for loops to apply similar transformations, each Polars with_columns() call processes sequentially.
This prevents the optimizer from seeing the full computation plan.
Solution
Instead, generate all Polars expressions programmatically before applying them together.
This enables Polars to:

See the complete computation plan upfront
Optimize across all expressions simultaneously
Parallelize operations across CPU cores

📖 View Full Article

🧪 Run code

⭐ View GitHub

itertools.chain: Merge Lists Without Intermediate Copies

Problem
Standard list merging with extend() or concatenation creates intermediate copies.
This memory overhead becomes significant when processing large lists.
Solution
itertools.chain() lazily merges multiple iterables without creating intermediate lists.

📖 View Full Article

🧪 Run code

☕️ Weekly Finds

fiftyone
[ML]
– Open-source tool for building high-quality datasets and computer vision models

llama-stack
[LLM]
– Composable building blocks to build Llama Apps with unified API for inference, RAG, agents, and more

grip
[Python Utils]
– Preview GitHub README.md files locally before committing them using GitHub’s markdown API

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #246: Faster Polars Queries with Programmatic Expressions Read More »

Newsletter #245: PySpark: Avoid Double Conversions with applyInArrow

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

PySpark: Avoid Double Conversions with applyInArrow

Problem
applyInPandas lets you apply Pandas functions in PySpark by converting data from Arrow→Pandas→Arrow for each operation.
This double conversion adds serialization overhead that slows down your transformations.
Solution
applyInArrow (introduced in PySpark 4.0.0) works directly with PyArrow tables, eliminating the Pandas conversion step entirely.
This keeps data in Arrow’s columnar format throughout the pipeline, making operations significantly faster.
Trade-off: PyArrow’s syntax is less intuitive than Pandas, but it’s worth it if you’re processing large datasets where performance matters.

📖 View Full Article

🧪 Run code

⭐ View GitHub

☕️ Weekly Finds

causal-learn
[ML]
– Python package for causal discovery that implements both classical and state-of-the-art causal discovery algorithms

POT
[ML]
– Python Optimal Transport library providing solvers for optimization problems related to signal, image processing and machine learning

qdrant
[MLOps]
– High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #245: PySpark: Avoid Double Conversions with applyInArrow Read More »

Newsletter #244: Handle Large Data with Polars Streaming Mode

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Handle Large Data with Polars Streaming Mode

Problem
In Polars, the .collect() method executes a lazy query and loads the entire dataset into memory. This works well for smaller data, but once the dataset grows beyond your available RAM, it can easily crash your process.
Solution
Add engine=”streaming” to .collect() to process large datasets in small batches without running out of memory.
How it works:

Breaks the dataset into smaller, memory-friendly chunks
Processes one batch at a time while freeing memory as it goes
Combines all partial results into a single DataFrame

📖 View Full Article

🧪 Run code

⭐ View GitHub

Build Professional Python Packages with UV –package

Problem
Python packages turn your code into reusable modules you can share across projects.
But building them requires complex setup with setuptools, managing build systems, and understanding distribution mechanics.
Solution
UV, a fast Python package installer and resolver, reduces the entire process to 2 simple commands:

uv init –package sets up your package structure instantly
uv build and uv publish to create and distribute to PyPI

📖 Learn more

⭐ View GitHub

☕️ Weekly Finds

whenever
[Python Utils]
– Modern datetime library for Python that ensures correct and type-checked datetime manipulations. It is DST-safe and way faster than standard datetime libraries.

lancedb
[MLOps]
– Developer-friendly, embedded retrieval database for AI/ML applications. The ultimate multimodal data platform designed for fast, scalable, and production-ready vector search.

grip
[Python Utils]
– Preview GitHub README.md files locally before committing them. A command-line server that uses GitHub’s Markdown API to render local readme files.

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #244: Handle Large Data with Polars Streaming Mode Read More »

Newsletter #243: Turn Your ML Tests Into Plain English with Behave

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Turn Your ML Tests Into Plain English with Behave

Problem
Unit testing matters in data science, but writing tests that business stakeholders can actually understand is a challenge.
If they can’t read the tests, they can’t confirm the logic matches business expectations.
Solution
Behave turns test cases into plain-English specifications using the Given/When/Then format.
How to use Behave for readable tests:

Write .feature files with Given/When/Then syntax
Implement steps in Python using @given, @when, @then decorators
Run “behave” to execute tests

This lets technical and business teams stay aligned without confusion.

📖 View Full Article

🧪 Run code

⭐ View GitHub

Build Powerful Data Pipelines with DuckDB + pandas

Problem
Pandas is great for data cleaning and feature engineering, while SQL excels at complex aggregations.
But moving data from pandas to a database and back can be tedious.
Solution
DuckDB solves this by letting you run SQL directly on pandas DataFrames and return the results back into pandas for further analysis.

📖 View Full Article

🧪 Run code

⭐ View GitHub

☕️ Weekly Finds

feast
[MLOps]
– Open source feature store for machine learning that manages existing infrastructure to productionize ML models with fast data consistency and leakage prevention

git-who
[Python Utils]
– CLI tool for industrial-scale git blaming that shows who is responsible for entire components or subsystems in your codebase, not just individual lines

organize
[Python Utils]
– File management automation tool for safe moving, renaming, copying files with conflict resolution, duplicate detection, and Exif tag extraction

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #243: Turn Your ML Tests Into Plain English with Behave Read More »

Newsletter #242: Build Faster Test Workflows with pytest Markers

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Build Faster Test Workflows with pytest Markers

Problem
Large projects often contain hundreds of tests, and executing them all for every minor change quickly becomes inefficient.
Solution
Pytest markers let you group tests by type, speed, or resource usage so you can run only what matters for your current task.
Quick guide to pytest markers:

Define markers in pytest.ini
Tag tests, for example: @pytest.mark.fast
Run specific tests: pytest -m fast
Skip certain tests: pytest -m “not slow”

📖 Learn more

🧪 Run code

📢 ANNOUNCEMENTS

Production-Ready Data Science Is Now on Leanpub
I am excited to share that Production-Ready Data Science is now live on Leanpub!
On Leanpub, you can choose your price and get updates as more examples and chapters roll out.
This book dives into the real engineering skills behind dependable data systems, including:

Testing
CI and CD
Environments and packaging
Data validation and logging
Reproducible workflows

If you want to take your data work beyond notebooks and into reliable production environments, this is for you.

→ Get the Book

☕️ Weekly Finds

AutoViz
[Data Viz]
– Automatically Visualize any dataset, any size with a single line of code

cognee
[LLM]
– Memory for AI Agents in 6 lines of code

niquests
[Python Utils]
– Simple, yet elegant, Python HTTP library: a drop-in replacement for python-requests

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #242: Build Faster Test Workflows with pytest Markers Read More »

Newsletter #241: Polars: Lazy CSV Loading with Query Optimization

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Polars: Lazy CSV Loading with Query Optimization

Problem
Pandas loads entire CSV files into memory immediately, even when you only need filtered or aggregated results.
This eager evaluation wastes memory and processing time on data you’ll never use.
Solution
Polars’ scan_csv() uses lazy evaluation to optimize queries before loading data.
How scan_csv() works:

Analyzes your entire query before loading any data
Identifies which columns you actually need
Applies filters while reading the CSV file
Loads only the relevant data into memory

📖 View Full Article

🧪 Run code

⭐ View GitHub

Build Structured AI Agents with LangChain TodoList

Problem
Complex workflows require structured planning. Without it, agents may execute subtasks out of order or miss crucial ones entirely.
Solution
LangChain v1.0 introduces TodoListMiddleware, which gives agents automatic task planning and progress tracking.
Key benefits:

Decomposes complex requests into sequential steps
Marks each task as pending, in_progress, or completed
Ensures agents follow logical execution order

📖 View Full Article

🧪 Run code

⭐ View GitHub

☕️ Weekly Finds

RAGxplorer
[LLM]
– Open-source tool to visualize your RAG embeddings and document chunks

nbQA
[Python Utils]
– Run ruff, isort, pyupgrade, mypy, pylint, flake8, and more on Jupyter Notebooks

prometheus-eval
[LLM]
– Evaluate your LLM’s response with specialized language models for reproducible assessment

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #241: Polars: Lazy CSV Loading with Query Optimization Read More »

Newsletter #240: Auto-Summarize Chat History with LangChain Middleware

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Auto-Summarize Chat History with LangChain Middleware

Problem
Long chat histories can quickly increase token usage, leading to higher API costs and slower responses.
Solution
LangChain v1.0 introduces SummarizationMiddleware that automatically condenses older messages when token thresholds are exceeded.
Key features:

Integrates into existing LangChain agents with minimal code changes
Automatic summarization when token limits are reached
Preserves recent context with configurable message retention
Uses efficient models for summarization (e.g., gpt-4o-mini)

📖 View Full Article

🧪 Run code

⭐ View GitHub

Batch Process DataFrames with PySpark Pandas UDF Vectorization

Problem
Traditional UDFs (User-Defined Functions) run your custom Python function on each row individually, which can significantly slow down DataFrame operations.
Solution
Pandas UDFs solve this by batching data into chunks and applying vectorized pandas transformations across entire columns, rather than looping through rows.
As a result, they can be 10 to 100 times faster on large DataFrames.

📖 View Full Article

🧪 Run code

⭐ View GitHub

☕️ Weekly Finds

lifelines
[ML]
– Survival analysis in Python with Kaplan Meier, Cox regression, and parametric models

nb-clean
[Python Utils]
– Clean Jupyter notebooks for version control by removing outputs, metadata, and execution counts

FuzzTypes
[Python Utils]
– Pydantic extension for autocorrecting field values using fuzzy string matching

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #240: Auto-Summarize Chat History with LangChain Middleware Read More »

Drop a line

Get in touch

Follow Us on Social Media

Newsletter Archive

Work with Khuyen Tran

Work with Khuyen Tran