python Archives

Newsletter #250: Extract Text from Any Document Format with Docling

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Build Schema-Flexible Pipelines with Polars Selectors

Problem
Hard-coding column names can break your code when the schema changes.
When columns of the same type are added or removed, you must update your code manually.
Solution
Polars col() function accepts data types to select all matching columns automatically.
This keeps your code flexible and robust to schema changes.

📖 View Full Article

🧪 Run code

⭐ View GitHub

Extract Text from Any Document Format with Docling

Problem
Have you ever needed to pull text from PDFs, Word files, slide decks, or images for a project? Writing a different parser for each format is slow and error-prone.
Solution
Docling’s DocumentConverter takes care of that by detecting the file type and applying the right parsing method for PDF, DOCX, PPTX, HTML, and images.
Other features of Docling:

AI-powered image descriptions for searchable diagrams
Export to pandas DataFrames, JSON, or Markdown
Structure-preserving output optimized for RAG pipelines
Built-in chunking strategies for vector databases
Parallel processing handles large document batches efficiently

📖 View Full Article

🧪 Run code

⭐ View GitHub

☕️ Weekly Finds

evals
[LLM]
– Framework for evaluating large language models (LLMs) or systems built using LLMs with existing registry of evals and ability to write custom evals

sklearn-bayes
[ML]
– Python package for Bayesian Machine Learning with scikit-learn API

databonsai
[Data Processing]
– Python library that uses LLMs to perform data cleaning tasks for categorization, transformation and curation

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Newsletter #250: Extract Text from Any Document Format with Docling Read More »

Newsletter #249: prek: Faster, Leaner Pre-Commit Hooks (Rust-Powered)

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

LangChain v1.0: Automate Tool Selection for Faster Agents

Problem
Agents with many tools waste tokens by sending all tool descriptions with every request.
This wastes tokens on irrelevant tool descriptions, making responses slower and more expensive.
Solution
LangChain v1.0 introduces LLMToolSelectorMiddleware that pre-filters relevant tools using a smaller model.
Key features:

Pre-filter tools using cheaper models like GPT-4o-mini
Limit tools sent to main agent (e.g., 3 most relevant)
Preserve critical tools with always_include parameter

📖 View Full Article

🧪 Run code

⭐ View GitHub

prek: Faster, Leaner Pre-Commit Hooks (Rust-Powered)

Problem
pre-commit is a framework for managing Git hooks that automatically run code quality checks before commits.
However, installing these hook environments (linters, formatters, etc.) can be slow and disk-intensive, especially in CI/CD pipelines where speed matters.
Solution
prek is a drop-in replacement for pre-commit that installs hook environments significantly faster while using 50% less disk space.
Built with Rust for maximum performance, prek reduces cache storage from 1.6GB to 810MB (benchmarked on Apache Airflow repository) without changing your workflow.
Key benefits:

Uses your existing .pre-commit-config.yaml files
Commands mirror pre-commit syntax (prek install-hooks, prek run)
Monorepo support with selector syntax for targeting specific projects or hooks
Install as a single binary with no dependencies

No configuration changes needed – just replace the command.

⭐ View GitHub

☕️ Weekly Finds

deepagents
[LLM]
– Build advanced AI agents with context isolation through sub-agent delegation. Features virtual file system for context offloading, specialized sub-agents with focused tool sets, and sophisticated agent architecture for real-world research and analysis tasks.

mcp-gateway
[MLOps]
– Docker MCP CLI plugin / MCP Gateway for production-grade AI agent stack. Enables multi-agent orchestration, intelligent interceptors, and enterprise security with Docker integration.

nbQA
[Python Utils]
– Run ruff, isort, pyupgrade, mypy, pylint, flake8, and more on Jupyter Notebooks. Command-line tool to run linters and formatters over Python code in Jupyter notebooks.

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #249: prek: Faster, Leaner Pre-Commit Hooks (Rust-Powered) Read More »

Code example: Build Mathematical Animations with Manim in Python

Newsletter #248: Build Mathematical Animations with Manim in Python – Fixed

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Build Mathematical Animations with Manim in Python

Problem
Static slides can only go so far when you’re explaining complex concepts.
Dynamic visuals make abstract ideas clearer, more engaging, and easier to understand.
Solution
Manim gives you the power to create professional mathematical animations in Python, just like the ones you see in 3Blue1Brown’s videos.
In the code below, Manim transforms equations into smooth visual steps:

Define equation steps using MathTex with LaTeX notation
Animate equation transformations with the Transform class
Control animation flow with play() and wait() methods
Render output with simple command: manim -p -ql script.py

📖 View Full Article

⭐ View GitHub

☕️ Weekly Finds

fast-langdetect
[Python Utils]
– 80x faster and 95% accurate language identification with Fasttext

FuncToWeb
[Python Utils]
– Transform any Python function into a web interface automatically

graphic-walker
[Data Viz]
– An open source alternative to Tableau for data exploration and visualization

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #248: Build Mathematical Animations with Manim in Python – Fixed Read More »

Manim: Create Mathematical Animations Like 3Blue1Brown Using Python

Leave a Comment / Blog, Python Utilities / Khuyen Tran

Table of Contents

Motivation
What is Manim?
Create a Blue Square that Grows from the Center
Turn a Square into a Circle
Customize Manim
Write Mathematical Equations with a Moving Frame
Moving and Zooming Camera
Graph
Move Objects Together
Trace Path
Recap

Motivation
Have you ever struggled with math concepts in a machine learning algorithm and turned to 3Blue1Brown as a learning resource? 3Blue1Brown is a popular math YouTube channel created by Grant Sanderson, known for exceptional explanations and stunning animations.
What if you could create similar animations to explain data science concepts to your teammates, managers, or followers?
Grant developed a Python package called Manim that enables you to create mathematical animations or pictures using Python. In this article, you will learn how to create beautiful mathematical animations using Manim.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

What is Manim?
Manim is an animation engine for creating precise, explanatory math videos. Note that there are two versions of Manim:

The original version created by Grant Sanderson
The community-maintained version

Since the Manim Community version is updated more frequently and better tested, we’ll use it for this tutorial.
To install the dependencies for the package, visit the Documentation. After the dependencies are installed, run:
pip install manim

Create a Blue Square that Grows from the Center
We’ll start by creating a blue square that grows from the center:
from manim import *

class GrowingSquare(Scene):
def construct(self):
square = Square(color=BLUE, fill_opacity=0.5)
self.play(GrowFromCenter(square))
self.wait()

The code leverages four key Manim elements to produce the animation.

Scene: Base class providing animation infrastructure via construct() and play() methods
Square(color, fill_opacity): Mobject for creating squares with specified stroke color and fill transparency
GrowFromCenter: Growth animation starting from the object’s center
wait(): 1-second pause in animation timeline

Save the script above as start.py. Now run the command below to generate a video for the script:
manim -p -ql start.py GrowingSquare

A video called GrowingSquare.mp4 will be saved in your local directory. You should see a blue square growing from the center.

Explanation of the options:

-p: play the video once it finishes generating
-ql: generate a video with low quality

To generate a video with high quality, use -qh instead.
To create a GIF instead of a video, add –format=gif to the command:
manim -p -ql –format=gif start.py GrowingSquare

Turn a Square into a Circle
Creating a square alone is not that exciting. Let’s transform the square into a circle using the Transform animation.
from manim import *

class SquareToCircle(Scene):
def construct(self):
circle = Circle()
circle.set_fill(PINK, opacity=0.5)

square = Square()
square.rotate(PI / 4)

self.play(Create(square))
self.play(Transform(square, circle))
self.play(FadeOut(square))

This code demonstrates creating shapes, styling them, and applying different animation effects.

Circle: Creates a circular shape
set_fill(color, opacity): Sets the fill color and transparency (0.5 = 50% transparent)
rotate(angle): Rotates the object by an angle (PI/4 = 45 degrees)
Create: Animation that draws the shape onto the screen
Transform: Smoothly morphs one shape into another
FadeOut: Makes the object gradually disappear

Find a comprehensive list of shapes in the Manim geometry documentation.
Customize Manim
If you don’t want the background to be black, you can change it to white using self.camera.background_color:
from manim import *

class CustomBackground(Scene):
def construct(self):
self.camera.background_color = WHITE

square = Square(color=BLUE, fill_opacity=0.5)
circle = Circle(color=RED, fill_opacity=0.5)

self.play(Create(square))
self.wait()
self.play(Transform(square, circle))
self.wait()

Find other ways to customize Manim in the configuration documentation.
Write Mathematical Equations with a Moving Frame
You can create animations that write mathematical equations using the MathTex class:
from manim import *

class WriteEquation(Scene):
def construct(self):
equation = MathTex(r"e^{i\pi} + 1 = 0")

self.play(Write(equation))
self.wait()

This code demonstrates writing mathematical equations using LaTeX.

MathTex: Creates mathematical equations using LaTeX notation (the r prefix indicates a raw string for LaTeX commands)
Write: Animation that makes text appear as if being written by hand

Or show step-by-step solutions to equations:
from manim import *

class EquationSteps(Scene):
def construct(self):
step1 = MathTex(r"2x + 5 = 13")
step2 = MathTex(r"2x = 8")
step3 = MathTex(r"x = 4")

self.play(Write(step1))
self.wait()
self.play(Transform(step1, step2))
self.wait()
self.play(Transform(step1, step3))
self.wait()

Let’s break down this code:

Three equation stages: Separate MathTex objects for each step of the solution
Progressive transformation: Transform morphs the first equation through intermediate and final stages
Timed pauses: wait() creates natural breaks between steps for a teaching pace

You can also highlight specific parts of equations using frames:
from manim import *

class MovingFrame(Scene):
def construct(self):
# Write equations
equation = MathTex("2x^2-5x+2", "=", "(x-2)(2x-1)")

# Create animation
self.play(Write(equation))

# Add moving frames
framebox1 = SurroundingRectangle(equation[0], buff=.1)
framebox2 = SurroundingRectangle(equation[2], buff=.1)

# Create animations
self.play(Create(framebox1))

self.wait()
# Replace frame 1 with frame 2
self.play(ReplacementTransform(framebox1, framebox2))

self.wait()

In this code, we use the following elements:

MathTex with multiple arguments: Breaks the equation into separate parts that can be individually accessed and highlighted
SurroundingRectangle(mobject, buff): Creates a box around an equation part (buff=0.1 sets the space between the box and the content)
ReplacementTransform: Smoothly moves and morphs one frame into another, replacing the first frame with the second

Moving and Zooming Camera
You can adjust the camera and select which part of the equations to zoom in using a class inherited from MovingCameraScene:
from manim import *

class MovingCamera(MovingCameraScene):
def construct(self):
equation = MathTex(
r"\frac{d}{dx}(x^2) = 2x"
)

self.play(Write(equation))
self.wait()

# Zoom in on the derivative
self.play(
self.camera.frame.animate.scale(0.5).move_to(equation[0])
)
self.wait()

This code shows how to zoom in on specific parts of your animation.

MovingCameraScene: A special scene type that allows the camera to move and zoom
self.camera.frame: Represents the camera’s view that you can move and zoom
animate.scale(0.5): Zooms in by making the camera view 50% smaller (objects appear twice as big)
move_to(equation[0]): Centers the camera on the first part of the equation

Graph
You can use Manim to create annotated graphs:
from manim import *

class Graph(Scene):
def construct(self):
axes = Axes(
x_range=[-3, 3, 1],
y_range=[-5, 5, 1],
x_length=6,
y_length=6,
)

# Add labels
axes_labels = axes.get_axis_labels(x_label="x", y_label="f(x)")

# Create a function graph
graph = axes.plot(lambda x: x**2, color=BLUE)
graph_label = axes.get_graph_label(graph, label="x^2")

self.add(axes, axes_labels)
self.play(Create(graph))
self.play(Write(graph_label))
self.wait()

This code shows how to create mathematical function graphs with labeled axes.

Axes: Creates a coordinate grid with specified ranges (e.g., x from -3 to 3, y from -5 to 5) and sizes
get_axis_labels: Adds “x” and “f(x)” labels to the coordinate axes
plot(lambda x: x2)**: Plots a mathematical function (here, x squared) as a curve on the axes
get_graph_label: Adds a label showing the function’s equation next to the plotted curve
self.add: Instantly displays objects on screen without animating them in

If you want to get an image of the last frame of a scene, add -s to the command:
manim -p -qh -s more.py Graph

You can also animate the process of setting up the axes:
manim -p -qh more.py Graph

Move Objects Together
You can use VGroup to group different Manim objects and move them together:
from manim import *

class MoveObjectsTogether(Scene):
def construct(self):
square = Square(color=BLUE)
circle = Circle(color=RED)

# Group objects
group = VGroup(square, circle)
group.arrange(RIGHT, buff=1)

self.play(Create(group))
self.wait()

# Move the entire group
self.play(group.animate.shift(UP * 2))
self.wait()

This code shows how to group objects and move them together as one unit.

VGroup: Groups multiple objects together so you can control them all at once (like selecting multiple items)
arrange(RIGHT, buff=1): Lines up the objects horizontally from left to right with 1 unit of space between them
shift(UP * 2): Moves the entire group upward by 2 units (multiplying by 2 makes it move twice as far)

You can also create multiple groups and manipulate them independently or together:
from manim import *

class GroupCircles(Scene):
def construct(self):
# Create circles
circle_green = Circle(color=GREEN)
circle_blue = Circle(color=BLUE)
circle_red = Circle(color=RED)

# Set initial positions
circle_green.shift(LEFT)
circle_blue.shift(RIGHT)

# Create 2 different groups
gr = VGroup(circle_green, circle_red)
gr2 = VGroup(circle_blue)
self.add(gr, gr2)
self.wait()

# Shift 2 groups down
self.play((gr + gr2).animate.shift(DOWN))

# Move only 1 group
self.play(gr.animate.shift(RIGHT))
self.play(gr.animate.shift(UP))

# Shift 2 groups to the right
self.play((gr + gr2).animate.shift(RIGHT))
self.play(circle_red.animate.shift(RIGHT))
self.wait()

Trace Path
You can use TracedPath to create a trace of a moving object:
from manim import *

class TracePath(Scene):
def construct(self):
dot = Dot(color=RED)

# Create traced path
path = TracedPath(dot.get_center, stroke_color=BLUE, stroke_width=4)
self.add(path, dot)

# Move the dot in a circular pattern
self.play(
MoveAlongPath(dot, Circle(radius=2)),
rate_func=linear,
run_time=4
)
self.wait()

This code shows how to create a trail that follows a moving object.

Dot: A small circular point that you can move around
TracedPath(dot.get_center): Creates a line that draws itself following the dot’s path (like a pen trail)
get_center: Gets the current position of the dot so TracedPath knows where to draw
MoveAlongPath(dot, Circle(radius=2)): Moves the dot along a circular path with radius 2
rate_func=linear: Makes the dot move at a constant speed (no speeding up or slowing down)
run_time=4: The animation takes 4 seconds to complete

For a more complex example, you can create a rolling circle that traces a cycloid pattern:
from manim import *

class RollingCircleTrace(Scene):
def construct(self):
# Create circle and dot
circ = Circle(color=BLUE).shift(4*LEFT)
dot = Dot(color=BLUE).move_to(circ.get_start())

# Group dot and circle
rolling_circle = VGroup(circ, dot)
trace = TracedPath(circ.get_start)

# Rotate the circle
rolling_circle.add_updater(lambda m: m.rotate(-0.3))

# Add trace and rolling circle to the scene
self.add(trace, rolling_circle)

# Shift the circle to 8*RIGHT
self.play(rolling_circle.animate.shift(8*RIGHT), run_time=4, rate_func=linear)

This code creates a rolling wheel animation that draws a curved pattern as it moves.

Setup: Creates a blue circle at the left side and places a blue dot at the circle’s edge
Grouping: Uses VGroup to combine the circle and dot so they move together
Path tracking: TracedPath watches the circle’s starting point and draws a line following it
Continuous rotation: add_updater makes the circle rotate automatically every frame, creating the rolling effect
Rolling motion: As the group shifts right, the rotation updater makes it look like a wheel rolling across the screen
Cycloid pattern: The traced path creates a wave-like curve showing the mathematical path a point on a rolling wheel makes

Recap
Congratulations! You have just learned how to use Manim and what it can do. To recap, there are three kinds of objects that Manim provides:

Mobjects: Objects that can be displayed on the screen, such as Circle, Square, Matrix, Angle, etc.
Scenes: Canvas for animations such as Scene, MovingCameraScene, etc.
Animations: Animations applied to Mobjects such as Write, Create, GrowFromCenter, Transform, etc.

There is so much more Manim can do that I cannot cover here. The best way to learn is through practice, so I encourage you to try the examples in this article and check out Manim’s tutorial.

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Manim: Create Mathematical Animations Like 3Blue1Brown Using Python Read More »

Newsletter #247: whenever: Simple Python Timezone Conversion

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

Build Safer APIs with Buf – Free Workshop
Building APIs is simple. Scaling them across teams and systems isn’t. Ensuring consistency, compatibility, and reliability quickly becomes a challenge as projects grow.
Buf provides a toolkit that makes working with Protocol Buffers faster, safer, and more consistent.
Join Buf for a live, one-hour workshop on building safer, more consistent APIs.
When: Nov 19, 2025 • 9 AM PDT | 12 PM EDT | 5 PM BST
What you’ll learn:

How Protobuf makes API development safer and simpler
API design best practices for real-world systems
How to extend Protobuf to data pipelines and streaming systems

→ Register for the workshop

📅 Today’s Picks

whenever: Simple Python Timezone Conversion

Problem
Adding 8 hours to 10pm shouldn’t give you the wrong morning time, but with Python’s datetime, it can.
The standard library fails during DST transitions, returning incorrect offsets when clocks change for daylight saving.
Solution
Whenever provides simple, explicit timezone conversion methods with clear semantics.
Key benefits:

DST-safe arithmetic with automatic offset adjustment
Type safety prevents naive/aware datetime bugs
Clean timezone conversions with .to_tz()
Nanosecond precision for deltas and timestamps
Pydantic integration for serialization

🧪 Run code

⭐ View GitHub

Build Readable Scatter Plots with adjustText Auto-Positioning

Problem
Text labels in matplotlib scatter plots frequently overlap with each other and data points, creating unreadable visualizations.
Manually repositioning each label to avoid overlaps is tedious and time-consuming.
Solution
adjustText automatically repositions labels to eliminate overlaps while connecting them to data points with arrows.
All you need is to collect your text objects and call adjust_text() with optional arrow styling.

🧪 Run code

⭐ View GitHub

📢 ANNOUNCEMENTS

Featured on LeanPub: Production-Ready Data Science
My book Production-Ready Data Science was featured on the LeanPub home page!
LeanPub is a leading platform for publishing and selling self-published technical books, so it’s truly an honor to see my work highlighted there.
The book shares everything I’ve learned about turning data science prototypes into reliable, production-ready systems, from managing dependencies to automating workflows.
Thank you to everyone who has purchased or shared it. Your support means everything.
The book is currently on sale for 58% off until November 16.

→ Get Your Copy Now (58% Off)

☕️ Weekly Finds

featuretools
[ML]
– An open source python library for automated feature engineering

datachain
[Data Processing]
– ETL, Analytics, Versioning for Unstructured Data – AI-data warehouse to enrich, transform and analyze data from cloud storages

logfire
[Python Utils]
– Uncomplicated Observability for Python and beyond – an observability platform built on OpenTelemetry from the team behind Pydantic

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #247: whenever: Simple Python Timezone Conversion Read More »

Newsletter #246: Faster Polars Queries with Programmatic Expressions

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝 COLLABORATION

How Protobuf makes API development safer and simpler
API design best practices for real-world systems
How to extend Protobuf to data pipelines and streaming systems

→ Register for the workshop

📅 Today’s Picks

Faster Polars Queries with Programmatic Expressions

Problem
When you want to use for loops to apply similar transformations, each Polars with_columns() call processes sequentially.
This prevents the optimizer from seeing the full computation plan.
Solution
Instead, generate all Polars expressions programmatically before applying them together.
This enables Polars to:

See the complete computation plan upfront
Optimize across all expressions simultaneously
Parallelize operations across CPU cores

📖 View Full Article

🧪 Run code

⭐ View GitHub

itertools.chain: Merge Lists Without Intermediate Copies

Problem
Standard list merging with extend() or concatenation creates intermediate copies.
This memory overhead becomes significant when processing large lists.
Solution
itertools.chain() lazily merges multiple iterables without creating intermediate lists.

📖 View Full Article

🧪 Run code

☕️ Weekly Finds

fiftyone
[ML]
– Open-source tool for building high-quality datasets and computer vision models

llama-stack
[LLM]
– Composable building blocks to build Llama Apps with unified API for inference, RAG, agents, and more

grip
[Python Utils]
– Preview GitHub README.md files locally before committing them using GitHub’s markdown API

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #246: Faster Polars Queries with Programmatic Expressions Read More »

Newsletter #245: PySpark: Avoid Double Conversions with applyInArrow

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

PySpark: Avoid Double Conversions with applyInArrow

Problem
applyInPandas lets you apply Pandas functions in PySpark by converting data from Arrow→Pandas→Arrow for each operation.
This double conversion adds serialization overhead that slows down your transformations.
Solution
applyInArrow (introduced in PySpark 4.0.0) works directly with PyArrow tables, eliminating the Pandas conversion step entirely.
This keeps data in Arrow’s columnar format throughout the pipeline, making operations significantly faster.
Trade-off: PyArrow’s syntax is less intuitive than Pandas, but it’s worth it if you’re processing large datasets where performance matters.

📖 View Full Article

🧪 Run code

⭐ View GitHub

☕️ Weekly Finds

causal-learn
[ML]
– Python package for causal discovery that implements both classical and state-of-the-art causal discovery algorithms

POT
[ML]
– Python Optimal Transport library providing solvers for optimization problems related to signal, image processing and machine learning

qdrant
[MLOps]
– High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #245: PySpark: Avoid Double Conversions with applyInArrow Read More »

Newsletter #244: Handle Large Data with Polars Streaming Mode

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Handle Large Data with Polars Streaming Mode

Problem
In Polars, the .collect() method executes a lazy query and loads the entire dataset into memory. This works well for smaller data, but once the dataset grows beyond your available RAM, it can easily crash your process.
Solution
Add engine=”streaming” to .collect() to process large datasets in small batches without running out of memory.
How it works:

Breaks the dataset into smaller, memory-friendly chunks
Processes one batch at a time while freeing memory as it goes
Combines all partial results into a single DataFrame

📖 View Full Article

🧪 Run code

⭐ View GitHub

Build Professional Python Packages with UV –package

Problem
Python packages turn your code into reusable modules you can share across projects.
But building them requires complex setup with setuptools, managing build systems, and understanding distribution mechanics.
Solution
UV, a fast Python package installer and resolver, reduces the entire process to 2 simple commands:

uv init –package sets up your package structure instantly
uv build and uv publish to create and distribute to PyPI

📖 Learn more

⭐ View GitHub

☕️ Weekly Finds

whenever
[Python Utils]
– Modern datetime library for Python that ensures correct and type-checked datetime manipulations. It is DST-safe and way faster than standard datetime libraries.

lancedb
[MLOps]
– Developer-friendly, embedded retrieval database for AI/ML applications. The ultimate multimodal data platform designed for fast, scalable, and production-ready vector search.

grip
[Python Utils]
– Preview GitHub README.md files locally before committing them. A command-line server that uses GitHub’s Markdown API to render local readme files.

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #244: Handle Large Data with Polars Streaming Mode Read More »

Newsletter #243: Turn Your ML Tests Into Plain English with Behave

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Turn Your ML Tests Into Plain English with Behave

Problem
Unit testing matters in data science, but writing tests that business stakeholders can actually understand is a challenge.
If they can’t read the tests, they can’t confirm the logic matches business expectations.
Solution
Behave turns test cases into plain-English specifications using the Given/When/Then format.
How to use Behave for readable tests:

Write .feature files with Given/When/Then syntax
Implement steps in Python using @given, @when, @then decorators
Run “behave” to execute tests

This lets technical and business teams stay aligned without confusion.

📖 View Full Article

🧪 Run code

⭐ View GitHub

Build Powerful Data Pipelines with DuckDB + pandas

Problem
Pandas is great for data cleaning and feature engineering, while SQL excels at complex aggregations.
But moving data from pandas to a database and back can be tedious.
Solution
DuckDB solves this by letting you run SQL directly on pandas DataFrames and return the results back into pandas for further analysis.

📖 View Full Article

🧪 Run code

⭐ View GitHub

☕️ Weekly Finds

feast
[MLOps]
– Open source feature store for machine learning that manages existing infrastructure to productionize ML models with fast data consistency and leakage prevention

git-who
[Python Utils]
– CLI tool for industrial-scale git blaming that shows who is responsible for entire components or subsystems in your codebase, not just individual lines

organize
[Python Utils]
– File management automation tool for safe moving, renaming, copying files with conflict resolution, duplicate detection, and Exif tag extraction

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #243: Turn Your ML Tests Into Plain English with Behave Read More »

Behave: Write Readable ML Tests with Behavior-Driven Development

Leave a Comment / Blog, MLOps, Python Utilities / Khuyen Tran

Table of Contents

Motivation
What is behave?
Invariance Testing
Directional Testing
Minimum Functionality Testing
Behave’s Trade-offs
Conclusion

Motivation
Imagine you create an ML model to predict customer sentiment based on reviews. Upon deploying it, you realize that the model incorrectly labels certain positive reviews as negative when they’re rephrased using negative words.

This is just one example of how an extremely accurate ML model can fail without proper testing. Thus, testing your model for accuracy and reliability is crucial before deployment.
But how do you test your ML model? One straightforward approach is to use unit-test:
from textblob import TextBlob

def test_sentiment_the_same_after_paraphrasing():
sent = "The hotel room was great! It was spacious, clean and had a nice view of the city."
sent_paraphrased = "The hotel room wasn't bad. It wasn't cramped, dirty, and had a decent view of the city."

sentiment_original = TextBlob(sent).sentiment.polarity
sentiment_paraphrased = TextBlob(sent_paraphrased).sentiment.polarity

both_positive = (sentiment_original > 0) and (sentiment_paraphrased > 0)
both_negative = (sentiment_original < 0) and (sentiment_paraphrased < 0)
assert both_positive or both_negative

This approach works but can be challenging for non-technical or business participants to understand. Wouldn’t it be nice if you could incorporate project objectives and goals into your tests, expressed in natural language?
Feature: Sentiment Analysis
As a data scientist
I want to ensure that my model is invariant to paraphrasing
So that my model can produce consistent results.

Scenario: Paraphrased text
Given a text
When the text is paraphrased
Then both text should have the same sentiment

That is when behave comes in handy.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

What is behave?
behave is a Python framework for behavior-driven development (BDD). BDD is a software development methodology that:

Emphasizes collaboration between stakeholders (such as business analysts, developers, and testers)
Enables users to define requirements and specifications for a software application

Since behave provides a common language and format for expressing requirements and specifications, it can be ideal for defining and validating the behavior of machine learning models.
To install behave, type:
pip install behave

Let’s use behave to perform various tests on machine learning models.

📚 For comprehensive unit testing strategies and best practices, check out Production-Ready Data Science.

Invariance Testing
Invariance testing tests whether an ML model produces consistent results under different conditions.
An example of invariance testing involves verifying if a model is invariant to paraphrasing. An ideal model should maintain consistent sentiment scores even when a positive review is rephrased using negative words like “wasn’t bad” instead of “was good.”

Feature File
To use behave for invariance testing, create a directory called features. Under that directory, create a file called invariant_test_sentiment.feature.
└── features/
└─── invariant_test_sentiment.feature

Within the invariant_test_sentiment.feature file, we will specify the project requirements:
Feature: Sentiment Analysis
As a data scientist
I want to ensure that my model is invariant to paraphrasing
So that my model can produce consistent results.

Scenario: Paraphrased text
Given a text
When the text is paraphrased
Then both text should have the same sentiment

The “Given,” “When,” and “Then” parts of this file present the actual steps that will be executed by behave during the test.
The Feature section serves as living documentation to provide context but does not trigger test execution.
Python Step Implementation
To implement the steps used in the scenarios with Python, start with creating the features/steps directory and a file called invariant_test_sentiment.py within it:
└── features/
├──── invariant_test_sentiment.feature
└──── steps/
└──── invariant_test_sentiment.py

The invariant_test_sentiment.py file contains the following code, which tests whether the sentiment produced by the TextBlob model is consistent between the original text and its paraphrased version.
from behave import given, then, when
from textblob import TextBlob

@given("a text")
def step_given_positive_sentiment(context):
context.sent = "The hotel room was great! It was spacious, clean and had a nice view of the city."

@when("the text is paraphrased")
def step_when_paraphrased(context):
context.sent_paraphrased = "The hotel room wasn't bad. It wasn't cramped, dirty, and had a decent view of the city."

@then("both text should have the same sentiment")
def step_then_sentiment_analysis(context):
# Get sentiment of each sentence
sentiment_original = TextBlob(context.sent).sentiment.polarity
sentiment_paraphrased = TextBlob(context.sent_paraphrased).sentiment.polarity

# Print sentiment
print(f"Sentiment of the original text: {sentiment_original:.2f}")
print(f"Sentiment of the paraphrased sentence: {sentiment_paraphrased:.2f}")

# Assert that both sentences have the same sentiment
both_positive = (sentiment_original > 0) and (sentiment_paraphrased > 0)
both_negative = (sentiment_original < 0) and (sentiment_paraphrased < 0)
assert both_positive or both_negative

Explanation of the code above:

The steps are identified using decorators matching the feature’s predicate: given, when, and then.
The decorator accepts a string containing the rest of the phrase in the matching scenario step.
The context variable allows you to share values between steps.

Run the Test
To run the invariant_test_sentiment.feature test, type the following command:
behave features/invariant_test_sentiment.feature

Output:
Feature: Sentiment Analysis # features/invariant_test_sentiment.feature:1
As a data scientist
I want to ensure that my model is invariant to paraphrasing
So that my model can produce consistent results in real-world scenarios.
Scenario: Paraphrased text
Given a text
When the text is paraphrased
Then both text should have the same sentiment
Traceback (most recent call last):
assert both_positive or both_negative
AssertionError

Captured stdout:
Sentiment of the original text: 0.66
Sentiment of the paraphrased sentence: -0.38

Failing scenarios:
features/invariant_test_sentiment.feature:6 Paraphrased text

0 features passed, 1 failed, 0 skipped
0 scenarios passed, 1 failed, 0 skipped
2 steps passed, 1 failed, 0 skipped, 0 undefined

The output shows that the first two steps passed and the last step failed, indicating that the model is affected by paraphrasing.
Directional Testing
Directional testing is a statistical method used to assess whether the impact of an independent variable on a dependent variable is in a particular direction, either positive or negative.
An example of directional testing is to check whether the presence of a specific word has a positive or negative effect on the sentiment score of a given text.

To use behave for directional testing, we will create two files directional_test_sentiment.feature and directional_test_sentiment.py.
└── features/
├──── directional_test_sentiment.feature
└──── steps/
└──── directional_test_sentiment.py

Feature File
The code in directional_test_sentiment.feature specifies the requirements of the project as follows:
Feature: Sentiment Analysis with Specific Word
As a data scientist
I want to ensure that the presence of a specific word
has a positive or negative effect on the sentiment score of a text

Scenario: Sentiment analysis with specific word
Given a sentence
And the same sentence with the addition of the word 'awesome'
When I input the new sentence into the model
Then the sentiment score should increase

Notice that “And” is added to the prose. Since the preceding step starts with “Given,” behave will rename “And” to “Given.”
Python Step Implementation
The code in directional_test_sentiment.py implements a test scenario, which checks whether the presence of the word “awesome ” positively affects the sentiment score generated by the TextBlob model.
from behave import given, then, when
from textblob import TextBlob

@given("a sentence")
def step_given_positive_word(context):
context.sent = "I love this product"

@given("the same sentence with the addition of the word '{word}'")
def step_given_a_positive_word(context, word):
context.new_sent = f"I love this {word} product"

@when("I input the new sentence into the model")
def step_when_use_model(context):
context.sentiment_score = TextBlob(context.sent).sentiment.polarity
context.adjusted_score = TextBlob(context.new_sent).sentiment.polarity

@then("the sentiment score should increase")
def step_then_positive(context):
assert context.adjusted_score > context.sentiment_score

The second step uses the parameter syntax {word}. When the .feature file is run, the value specified for {word} in the scenario is automatically passed to the corresponding step function.
This means that if the scenario states that the same sentence should include the word “awesome,” behave will automatically replace {word} with “awesome.”

This conversion is useful when you want to use different values for the {word} parameter without changing both the .feature file and the .py file.

Run the Test
behave features/directional_test_sentiment.feature

Output:
Feature: Sentiment Analysis with Specific Word
As a data scientist
I want to ensure that the presence of a specific word has a positive or negative effect on the sentiment score of a text
Scenario: Sentiment analysis with specific word
Given a sentence
And the same sentence with the addition of the word 'awesome'
When I input the new sentence into the model
Then the sentiment score should increase

1 feature passed, 0 failed, 0 skipped
1 scenario passed, 0 failed, 0 skipped
4 steps passed, 0 failed, 0 skipped, 0 undefined

Since all the steps passed, we can infer that the sentiment score increases due to the new word’s presence.
Minimum Functionality Testing
Minimum functionality testing is a type of testing that verifies if the system or product meets the minimum requirements and is functional for its intended use.
One example of minimum functionality testing is to check whether the model can handle different types of inputs, such as numerical, categorical, or textual data. To test with diverse inputs, generate test data using Faker for more comprehensive validation.

To use minimum functionality testing for input validation, create two files minimum_func_test_input.feature and minimum_func_test_input.py.
└── features/
├──── minimum_func_test_input.feature
└──── steps/
└──── minimum_func_test_input.py

Feature File
The code in minimum_func_test_input.feature specifies the project requirements as follows:
Feature: Test my_ml_model

Scenario: Test integer input
Given I have an integer input of 42
When I run the model
Then the output should be an array of one number

Scenario: Test float input
Given I have a float input of 3.14
When I run the model
Then the output should be an array of one number

Scenario: Test list input
Given I have a list input of [1, 2, 3]
When I run the model
Then the output should be an array of three numbers

Python Step Implementation
The code in minimum_func_test_input.py implements the requirements, checking if the output generated by predict for a specific input type meets the expectations.
from behave import given, then, when

import numpy as np
from sklearn.linear_model import LinearRegression
from typing import Union

def predict(input_data: Union[int, float, str, list]):
"""Create a model to predict input data"""

# Reshape the input data
if isinstance(input_data, (int, float, list)):
input_array = np.array(input_data).reshape(-1, 1)
else:
raise ValueError("Input type not supported")

# Create a linear regression model
model = LinearRegression()

# Train the model on a sample dataset
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
model.fit(X, y)

# Predict the output using the input array
return model.predict(input_array)

@given("I have an integer input of {input_value}")
def step_given_integer_input(context, input_value):
context.input_value = int(input_value)

@given("I have a float input of {input_value}")
def step_given_float_input(context, input_value):
context.input_value = float(input_value)

@given("I have a list input of {input_value}")
def step_given_list_input(context, input_value):
context.input_value = eval(input_value)

@when("I run the model")
def step_when_run_model(context):
context.output = predict(context.input_value)

@then("the output should be an array of one number")
def step_then_check_output(context):
assert isinstance(context.output, np.ndarray)
assert all(isinstance(x, (int, float)) for x in context.output)
assert len(context.output) == 1

@then("the output should be an array of three numbers")
def step_then_check_output(context):
assert isinstance(context.output, np.ndarray)
assert all(isinstance(x, (int, float)) for x in context.output)
assert len(context.output) == 3

Run the Test
behave features/minimum_func_test_input.feature

Output:
Feature: Test my_ml_model

Scenario: Test integer input
Given I have an integer input of 42
When I run the model
Then the output should be an array of one number

Scenario: Test float input
Given I have a float input of 3.14
When I run the model
Then the output should be an array of one number

Scenario: Test list input
Given I have a list input of [1, 2, 3]
When I run the model
Then the output should be an array of three numbers

1 feature passed, 0 failed, 0 skipped
3 scenarios passed, 0 failed, 0 skipped
9 steps passed, 0 failed, 0 skipped, 0 undefined

Since all the steps passed, we can conclude that the model outputs match our expectations.
Behave’s Trade-offs
This section will outline some drawbacks of using behave compared to pytest, and explain why it may still be worth considering the tool.
Learning Curve
Using Behavior-Driven Development (BDD) in behavior may result in a steeper learning curve than the more traditional testing approach used by pytest.

Counter argument: The focus on collaboration in BDD can lead to better alignment between business requirements and software development, resulting in a more efficient development process overall.

Slower performance
behave tests can be slower than pytest tests because behave must parse the feature files and map them to step definitions before running the tests.

Counter argument: behave’s focus on well-defined steps can lead to tests that are easier to understand and modify, reducing the overall effort required for test maintenance.

Less flexibility
behave is more rigid in its syntax, while pytest allows more flexibility in defining tests and fixtures.

Counter argument: behave’s rigid structure can help ensure consistency and readability across tests, making them easier to understand and maintain over time.

Conclusion
You’ve learned how to use behave to write readable tests for a data science project.
Key takeaways:
How behave works:

Feature files serve as living documentation: They communicate test intent in natural language while driving actual test execution
Step decorators bridge features and code: @given, @when, and @then decorators map feature file steps to Python test implementations

Three essential test types:

Invariance testing: Ensures your model produces consistent results when inputs are paraphrased or slightly modified
Directional testing: Validates that specific changes have the expected positive or negative impact on predictions
Minimum functionality testing: Verifies your model handles different input types correctly

Despite trade-offs like a steeper learning curve and slower performance compared to pytest, behave excels where it matters most for ML testing: making model behavior transparent and testable by both technical and non-technical team members.
Related Tutorials

Configuration Management: Hydra for Python Configuration for managing test specifications with YAML-like syntax

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Behave: Write Readable ML Tests with Behavior-Driven Development Read More »

python

Manim: Create Mathematical Animations Like 3Blue1Brown Using Python

Behave: Write Readable ML Tests with Behavior-Driven Development

Drop a line

Get in touch

Follow Us on Social Media

python

Work with Khuyen Tran

Work with Khuyen Tran