Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Newsletter Archive

Automated newsletter archive from Klaviyo campaigns

Newsletter #300: Browser-Use: Automate Any Browser Task with Plain English

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

mem0: Give Your AI Memory Without Building a Vector DB

Problem
When you build an AI app using the OpenAI or Anthropic API, every conversation starts from scratch with no built-in memory between sessions.
Adding memory yourself with a vector database like ChromaDB requires writing custom extraction, deduplication, and scoping logic on top of the storage layer.
Solution
mem0 handles all of that in a single function call. Just pass in conversations and retrieve relevant memories when needed.
Key features:

Automatic fact extraction from raw conversations via memory.add()
Cross-session retrieval with memory.search() in any future conversation
Automatic conflict resolution when user preferences change over time

🧪 Run code

Browser-Use: Automate Any Browser Task with Plain English

Problem
Most data collection tasks go beyond simple extraction. You need to log in, apply filters, navigate pagination, and then gather results.
Selenium can handle navigation, but it requires maintaining CSS selectors that can easily break when a site changes.
Solution
Browser Use simplifies the entire process. Describe what you want, and it navigates, clicks, types, and extracts automatically.
Key features:

Natural language task descriptions
Works with GPT-4, Claude, Gemini, and local models via Ollama
Structured output with Pydantic models

📚 Latest Deep Dives

How to Test GitHub Actions Locally with act

Debugging GitHub Actions is painfully slow. Every YAML change requires a commit, a push, and a 2-5 minute wait just to find out you missed a colon.
This article introduces act, a CLI tool that runs GitHub Actions workflows locally in Docker containers.
You’ll set up an ML testing pipeline and learn to pass secrets, run specific jobs, and validate workflows in seconds.

📖 View Full Article

☕️ Weekly Finds

open-webui
[LLM]
– A self-hosted AI platform with built-in RAG, model builder, and support for Ollama and OpenAI-compatible APIs

ragflow
[RAG]
– An open-source RAG engine with deep document understanding for unstructured data in any format

vllm
[MLOps]
– A high-throughput, memory-efficient inference and serving engine for LLMs with multi-GPU support

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Newsletter #300: Browser-Use: Automate Any Browser Task with Plain English Read More »

Newsletter #299: latexify_py: Turn Python Functions into LaTeX with One Decorator

Grab your coffee. Here are this week’s highlights.

🤝 COLLABORATION

Give Your AI Agent Live Web Access with Bright Data MCP
With basic search APIs, agents often miss critical context from sources like social platforms, forums, news, and answer engines. That leads to incomplete or outdated responses.
Bright Data’s MCP server unifies all web data access into one interface your AI agent can use directly.
With Bright Data MCP, your AI agent can access:

Search engines (Google, Bing, more)
Social media (Twitter/X, Reddit, Instagram, TikTok)
Web archives (historical web data, years deep)
Answer engines (ChatGPT, Perplexity, Gemini)

All through one connection.

🔗 Try Bright Data MCP

📅 Today’s Picks

latexify_py: Turn Python Functions into LaTeX with One Decorator

Problem
Non-programmers cannot easily read Python logic. However, manually converting it to LaTeX is slow and quickly becomes outdated as the code changes.
Solution
latexify_py solves this with a single decorator, generating LaTeX directly from your function so the math stays readable and always in sync with the code.
Key capabilities:

Three decorators for different outputs: expressions, full equations, or pseudocode
Displays rendered LaTeX directly in Jupyter cells
Functions still work normally when called

📖 View Full Article

🧪 Run code

act: Run GitHub Actions Locally with Docker

Problem
GitHub Actions has no local execution mode. You can’t test a step, inspect an environment variable, or reproduce a runner-specific failure on your own machine.
Each change requires a commit and a wait for the cloud runner. A small mistake like a missing secret means starting the loop again.
Solution
With act, you can execute workflows locally using Docker. Failures surface immediately, making it easier to iterate and commit only when the workflow passes.

📚 Latest Deep Dives

How to Test GitHub Actions Locally with act

Debugging GitHub Actions is painfully slow. Every YAML change requires a commit, a push, and a 2-5 minute wait just to find out you missed a colon.
This article introduces act, a CLI tool that runs GitHub Actions workflows locally in Docker containers.
You’ll set up an ML testing pipeline and learn to pass secrets, run specific jobs, and validate workflows in seconds.

📖 View Full Article

☕️ Weekly Finds

json_repair
[LLM]
– A Python module to repair invalid JSON, especially from LLM outputs, with schema validation support

pyrsistent
[Python Utilities]
– Persistent, immutable, and functional data structures for Python

prek
[Code Quality]
– A faster, Rust-based reimagining of pre-commit with monorepo support and parallel hook execution

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Newsletter #299: latexify_py: Turn Python Functions into LaTeX with One Decorator Read More »

Newsletter #298: Chronos: Forecast Any Time Series Without Training a Model

Grab your coffee. Here are this week’s highlights.

🤝 COLLABORATION

Give Your AI Agent Live Web Access with Bright Data MCP
With basic search APIs, agents often miss critical context from sources like social platforms, forums, news, and answer engines. That leads to incomplete or outdated responses.
Bright Data’s MCP server unifies all web data access into one interface your AI agent can use directly.
With Bright Data MCP, your AI agent can access:

Search engines (Google, Bing, more)
Social media (Twitter/X, Reddit, Instagram, TikTok)
Web archives (historical web data, years deep)
Answer engines (ChatGPT, Perplexity, Gemini)

All through one connection.

🔗 Try Bright Data MCP

📅 Today’s Picks

altimate-code: The Missing AI Layer for Data Engineering Teams

Problem
General AI tools can write SQL and catch obvious mistakes. But they cannot systematically detect anti-patterns, trace lineage, or keep warehouse costs under control.
That gap can lead to inefficient queries, broken dependencies, and hidden compliance risks building up over time.
Solution
I recently tried altimate-code, an open-source agent with 100+ tools purpose-built for data engineers, and built a demo repo to test it.
From a single prompt, it generated a full dbt project with staging, intermediate, and mart layers, added automated tests, and built an interactive dashboard.
What makes it different:

100+ tools that analyze SQL through structural parsing, not text guessing
Works across your stack including Snowflake, BigQuery, Databricks, DuckDB, and more
Model-agnostic. Compatible with OpenAI, Anthropic, Gemini, Ollama, and others

Chronos: Forecast Any Time Series Without Training a Model

Problem
Traditional forecasting requires domain-specific data, feature engineering, and multiple rounds of model tuning.
Solution
Chronos is a family of pretrained time series forecasting models from Amazon Science that deliver zero-shot predictions out of the box.
Simply load a pretrained model and generate forecasts on any time series data, with no fine-tuning required.
If zero-shot accuracy isn’t enough, you can fine-tune on your data with AutoGluon in a few lines.

🧪 Run code

📚 Latest Deep Dives

uv vs pixi: Which Python Environment Manager Should You Use for Data Science?

What if one tool could manage both your Python packages and compiled system libraries?
uv installs Python packages from PyPI, but it doesn’t support compiled C/C++ libraries.
The typical workaround is to install system libraries separately using an OS package manager, then manually align versions with your Python dependencies.
Since these system dependencies aren’t captured in project files, reproducing the environment across machines can be unreliable.
pixi solves this by managing both Python packages from PyPI and compiled system libraries from conda-forge in a single tool.
Quick comparison:

uv: fast, reliable lockfiles, Python-only
conda: system libraries supported, but slower and no lockfiles
pixi: fast, unified, with system libraries, lockfiles, and a built-in task runner

In this article, I compare uv and pixi on a real ML project so you can see how they perform in practice.

📖 View Full Article

☕️ Weekly Finds

timesfm
[Machine Learning]
– Pretrained time series foundation model by Google Research for zero-shot forecasting

darts
[Machine Learning]
– A Python library for user-friendly forecasting and anomaly detection on time series

orbit
[Machine Learning]
– A Python package for Bayesian time series forecasting with probabilistic models under the hood

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Newsletter #298: Chronos: Forecast Any Time Series Without Training a Model Read More »

Newsletter #297: Polars scan_csv: Merge CSVs with Different Schemas in One Call

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

Polars scan_csv: Merge CSVs with Different Schemas in One Call

Problem
Polars’ scan_csv lets you load multiple CSV files lazily, reading data only when needed.
But before v1.39.0, every file had to share the same columns, or you’d get a SchemaError.
Solution
Polars v1.39.0 introduces missing_columns="insert" in scan_csv, allowing you to combine multiple files in one call while null-filling any missing columns.

📖 View Full Article

🧪 Run code

Build Professional Python Packages with UV –package

Problem
Python packages turn your code into reusable modules you can share across projects.
But building them requires complex setup with setuptools, managing build systems, and understanding distribution mechanics.
Solution
UV, a fast Python package installer and resolver, reduces the entire process to 2 simple commands:

uv init –package sets up your package structure instantly
uv build and uv publish to create and distribute to PyPI

📖 Learn more

📚 Latest Deep Dives

uv vs pixi: Which Python Environment Manager Should You Use for Data Science?

What if one tool could manage both your Python packages and compiled system libraries?
uv installs Python packages from PyPI, but it doesn’t support compiled C/C++ libraries.
The typical workaround is to install system libraries separately using an OS package manager, then manually align versions with your Python dependencies.
Since these system dependencies aren’t captured in project files, reproducing the environment across machines can be unreliable.
pixi solves this by managing both Python packages from PyPI and compiled system libraries from conda-forge in a single tool.
Quick comparison:

uv: fast, reliable lockfiles, Python-only
conda: system libraries supported, but slower and no lockfiles
pixi: fast, unified, with system libraries, lockfiles, and a built-in task runner

In this article, I compare uv and pixi on a real ML project so you can see how they perform in practice.

📖 View Full Article

☕️ Weekly Finds

datachain
[Data Processing]
– Process and curate unstructured data from cloud storages using local ML models and Python

label-studio
[Data Processing]
– Open source data labeling and annotation tool with standardized output format for ML workflows

qsv
[Command Line]
– Blazingly fast CSV command-line toolkit for slicing, dicing, and analyzing tabular data

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Newsletter #297: Polars scan_csv: Merge CSVs with Different Schemas in One Call Read More »

Newsletter #277: Handle Messy Data with RapidFuzz Fuzzy Matching

📅 Today’s Picks

Swap AI Prompts Instantly with MLflow Prompt Registry

Problem
Finding the right prompt often takes experimentation: tweaking wording, adjusting tone, testing different instructions.
But with prompts hardcoded in your codebase, each test requires a code change and redeployment.
Solution
MLflow Prompt Registry solves this with aliases. Your code references an alias like “production” instead of a version number, so you can swap versions without changing it.
Here’s how it works:

Every prompt edit creates a new immutable version with a commit message
Register prompts once, then assign aliases to specific versions
Deploy to different environments by creating aliases like “staging” and “production”
Track full version history with metadata and tags for each prompt

⭐ View GitHub

🔄 Worth Revisiting

Automate LLM Evaluation at Scale with MLflow make_judge()

Problem
When you ship LLM features without evaluating them, models might hallucinate, violate safety guidelines, or return incorrectly formatted responses.
Manual review doesn’t scale. Reviewers might miss subtle issues when evaluating thousands of outputs, and scoring standards often vary between people.
Solution
MLflow make_judge() applies the same evaluation standards to every output, whether you’re checking 10 or 10,000 responses.
Key capabilities:

Define evaluation criteria once, reuse everywhere
Automatic rationale explaining each judgment
Built-in judges for safety, toxicity, and hallucination detection
Typed outputs that never return unexpected formats

⭐ View GitHub

☕️ Weekly Finds

gspread
[Data Processing]
– Google Sheets Python API for reading, writing, and formatting spreadsheets

zeppelin
[Data Analysis]
– Web-based notebook for interactive data analytics with SQL, Scala, and more

vectorbt
[Data Science]
– Fast engine for backtesting, algorithmic trading, and research in Python

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Newsletter #277: Handle Messy Data with RapidFuzz Fuzzy Matching Read More »

Newsletter #296: Scrapling: Adaptive Web Scraping in Python

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

Scrapling: Adaptive Web Scraping in Python

Problem
Traditional scraping with BeautifulSoup uses hardcoded CSS selectors to find elements on a page.
If the site updates its layout, those selectors no longer match and the scraper ends up returning empty data.
Solution
Instead of relying only on selectors, Scrapling records how elements appear during the initial scrape.
If the site is redesigned later, it can use that stored structure to find the same elements again.

Ibis: One Python API for 25+ Database Backends

Problem
Many data workflows begin with pandas for quick experimentation, while production pipelines might run on databases like PostgreSQL or BigQuery.
Moving from prototype to production usually means rewriting the same transformation logic in SQL. That translation takes time and can easily introduce errors.
Solution
Ibis solves this by letting you define transformations once in Python and compiling them into native SQL for 25+ backends automatically.

📖 View Full Article

☕️ Weekly Finds

Kronos
[Machine Learning]
– A decoder-only foundation model pre-trained on K-line sequences for financial market forecasting

pixi
[Environment Management]
– Fast, cross-platform package manager built on the Conda ecosystem, written in Rust

MinerU
[OCR/PDF Processing]
– One-stop tool for converting PDFs, webpages, and e-books into machine-readable markdown and JSON

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Newsletter #296: Scrapling: Adaptive Web Scraping in Python Read More »

Newsletter #295: Marker: Smart PDF Extraction with Hybrid LLM Mode

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

Marker: Smart PDF Extraction with Hybrid LLM Mode

Problem
Standard OCR pipelines often miss inline math, split tables across pages, and lose the relationships between form fields.
Sending the full document to an LLM can improve accuracy, but it’s slow and expensive at scale.
Solution
Marker‘s hybrid mode takes a more targeted approach:

Its deep learning pipeline handles the bulk of conversion
Then an LLM steps in only for the hard parts: table merging, LaTeX formatting, and form extraction

Marker supports OpenAI, Gemini, Claude, Ollama, and Azure out of the box.

📖 View Full Article

Qdrant: Fast Vector Search in Rust with a Python API

Problem
Building semantic search typically starts with storing vectors in Python lists and computing cosine similarity manually.
But brute-force comparison scales linearly with your dataset, making every query slower as your data grows.
Solution
Qdrant is a vector search engine built in Rust that indexes your vectors for fast retrieval.
Key features:

In-memory mode for local prototyping with no server setup
Seamlessly scale to millions of vectors in production with the same Python API
Built-in support for cosine, dot product, and Euclidean distance
Sub-second query times even for millions of vectors

🧪 Run code

📚 Latest Deep Dives

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared

Extracting tables from PDFs can be surprisingly difficult. A table that looks neatly structured in a PDF is actually saved as text placed at specific coordinates on the page. This makes it difficult to preserve the original layout when extracting the table.
This article will introduce three Python tools that attempt to solve this problem: Docling, Marker, and LlamaParse.

📖 View Full Article

☕️ Weekly Finds

Dify
[LLM]
– Open-source LLM app development platform with AI workflow, RAG pipeline, and agent capabilities

PageIndex
[RAG]
– Document index for vectorless, reasoning-based RAG

MCP Server Chart
[Data Visualization]
– A visualization MCP server for generating 25+ visual charts using AntV

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Newsletter #295: Marker: Smart PDF Extraction with Hybrid LLM Mode Read More »

Newsletter #294: pandas 3.0: 5-10x Faster String Operations with PyArrow

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

pandas 3.0: 5-10x Faster String Operations with PyArrow

Problem
Traditionally, pandas stores strings as object dtype, where each string is a separate Python object scattered across memory.
This makes string operations slow and the dtype ambiguous, since both pure string columns and mixed-type columns show up as object.
Solution
pandas 3.0 introduces a dedicated str dtype backed by PyArrow, which stores strings in contiguous memory blocks instead of individual Python objects.
Key benefits:

5-10x faster string operations because data is stored contiguously
50% lower memory by eliminating Python object overhead
Clear distinction between string and mixed-type columns

📖 View Full Article

🧪 Run code

Build Self-Documenting Regex with Pregex

Problem
Regex patterns like [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} are difficult to read and intimidating.
Team members without regex expertise might struggle to understand and modify these validation patterns.
Solution
Team members without regex expertise might struggle to understand and modify these validation patterns.
Pregex transforms regex into readable Python code using descriptive components.
Key benefits:

Code that explains its intent without comments
Easy modification without regex expertise
Composable patterns for complex validation
Export to regex format when needed

📖 View Full Article

🧪 Run code

📚 Latest Deep Dives

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared

PDF files do not store tables as structured data. Instead, they position text at specific coordinates on the page.
Table extraction tools must reconstruct the structure by determining which values belong in which rows and columns.
The problem becomes even harder when tables include multi-level headers, merged cells, or complex layouts.
To explore this problem, I experimented with three tools designed for PDF table extraction: LlamaParse, Marker, and Docling. Each tool takes a different approach.
Performance overview:

Docling: Fastest local option, but struggles with complex tables
Marker: Handles complex layouts well and runs locally, but is much slower
LlamaParse: Most accurate on complex tables and fastest overall, but requires a cloud API

In this article, I share the code, examples, and results from testing each tool.
📖 View Full Article

☕️ Weekly Finds

Lance
[Data Processing]
– Modern columnar data format for ML with 100x faster random access than Parquet

Mathesar
[Dashboard]
– Spreadsheet-like interface for PostgreSQL that lets anyone view, edit, and query data

dotenvx
[DevOps]
– A better dotenv with encryption, multiple environments, and cross-platform support

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Newsletter #294: pandas 3.0: 5-10x Faster String Operations with PyArrow Read More »

Newsletter #293: act: Run GitHub Actions Locally with Docker

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

act: Run GitHub Actions Locally with Docker

Problem
GitHub Actions has no local execution mode. You can’t test a step, inspect an environment variable, or reproduce a runner-specific failure on your own machine.
Each change requires a commit and a wait for the cloud runner. A small mistake like a missing secret means starting the loop again.
Solution
With act, you can execute workflows locally using Docker. Failures surface immediately, making it easier to iterate and commit only when the workflow passes.

ScrapeGraphAI: Research Multiple Sites with One Prompt

Problem
With BeautifulSoup, every site needs its own selectors, and you need to manually combine the results into a unified format.
When any site redesigns its layout, those selectors break and you are back to fixing code.
Solution
ScrapeGraphAI‘s SearchGraph fixes this by replacing selectors with a natural language prompt.
Here’s what it handles:

Automatic web search for relevant pages
AI-powered scraping that adapts to any layout
Structured output with source URLs for verification
Works with any LLM provider (OpenAI, Ollama, etc.)

📖 View Full Article

🎓 Latest Interactive Course

Python Data Modeling with Dataclasses and Pydantic

Choosing between dict, NamedTuple, dataclass, and Pydantic comes down to how much safety you need. In this free interactive course, you’ll learn when to use each:

Dictionary: Flexible, but no built-in field checks. Typos and missing keys only show up at runtime.
NamedTuple: Immutable with fixed fields, helping catch mistakes early.
dataclass: Mutable data containers with defaults and optional validation logic.
Pydantic: Strong type validation, automatic coercion, and detailed error reporting.

All exercises run directly in your browser. No installation required.

☕️ Weekly Finds

agent-browser
[Agents]
– Headless browser automation CLI for AI agents, built on Playwright

pyscn
[Code Quality]
– Intelligent Python code quality analyzer with dead code detection and complexity analysis

pyupgrade
[Code Quality]
– Automatically upgrade Python syntax to newer versions of the language

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Newsletter #293: act: Run GitHub Actions Locally with Docker Read More »

Newsletter #292: SQLFluff: Auto-Fix Messy SQL with One Command

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

Evaluate LLM Apps in One Line with PydanticAI

Problem
Testing LLM apps means validating multiple factors at once: is the answer correct, properly structured, fast enough, and natural sounding?
Rewriting this logic for every project is inefficient and error-prone.
Solution
pydantic-ai includes pydantic-evals, which provides these capabilities out of the box. Simply choose the evaluators you need and add them to your evaluation suite.
Built-in evaluators:

Deterministic: validate that outputs are correct, properly typed, and fast enough
LLM-as-judge: have another LLM grade qualities like helpfulness or tone
Report-level: generate classification metrics across all cases automatically

🧪 Run code

SQLFluff: Auto-Fix Messy SQL with One Command

Problem
Consistent SQL style matters. It improves readability, speeds up code reviews, and makes bugs easier to identify.
Manual reviews can catch formatting issues, but they’re time-consuming and often inconsistent.
Solution
SQLFluff solves this with automated linting and formatting across 30+ SQL dialects. It identifies violations, applies consistent standards, and auto-corrects many problems.
SQLFluff also supports the following templates:

Jinja
SQL placeholders (e.g. SQLAlchemy parameters)
Python format strings
dbt (requires plugin)

🧪 Run code

🎓 Latest Interactive Course

Python Data Modeling with Dataclasses and Pydantic

Choosing between dict, NamedTuple, dataclass, and Pydantic comes down to how much safety you need. In this free interactive course, you’ll learn when to use each:

Dictionary: Flexible, but no built-in field checks. Typos and missing keys only show up at runtime.
NamedTuple: Immutable with fixed fields, helping catch mistakes early.
dataclass: Mutable data containers with defaults and optional validation logic.
Pydantic: Strong type validation, automatic coercion, and detailed error reporting.

All exercises run directly in your browser. No installation required.

☕️ Weekly Finds

spec-kit
[Dev Tools]
– Toolkit for Spec-Driven Development that helps define specs, generate plans and tasks, and implement code with AI coding tools

ty
[Code Quality]
– Extremely fast Python type checker and language server written in Rust, by the creators of uv and Ruff

nbQA
[Code Quality]
– Run ruff, isort, pyupgrade, mypy, pylint, flake8, and more on Jupyter Notebooks

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Newsletter #292: SQLFluff: Auto-Fix Messy SQL with One Command Read More »

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran