Newsletter Archive

Newsletter #306: TimescaleDB: Turn PostgreSQL into a Time-Series Engine with One Extension

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

TimescaleDB: Turn PostgreSQL into a Time-Series Engine with One Extension (Sponsored)

Problem
In standard PostgreSQL, all rows live in one table. As time-series data grows into the millions, queries cannot skip irrelevant data, so even recent lookups scan far more than needed.
Solution
TimescaleDB solves this with hypertables, which automatically partition data into time-based chunks.
Queries only touch the relevant chunks, leaving the rest untouched.
Other capabilities:

Shrink storage by up to 95% with columnar compression that stays fully queryable
Faster queries with continuous aggregates that refresh only new data
Built-in retention policies to automatically remove old data

Guidance: One Function for Clean LLM Labels

Problem
Classification tasks with LLMs can get messy. Instead of a clean label, you might get “Option A”, “The answer is A”, or a full explanation.
Cleaning this up requires extra parsing, retries, and validation that can make your system fragile.
Solution
With Guidance, the select() function constrains the model to return exactly one option from your list.
Key benefits:

Guarantees output matches one of your predefined options
Eliminate parsing code and regex patterns
Works with any list of valid choices

📖 View Full Article

☕️ Weekly Finds

TimesFM
[ML]
– Pretrained time-series foundation model by Google Research for zero-shot forecasting

timesketch
[Data Processing]
– Collaborative forensic timeline analysis tool

Orbit
[ML]
– Bayesian time series forecasting with an intuitive initialize-fit-predict interface

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Newsletter #306: TimescaleDB: Turn PostgreSQL into a Time-Series Engine with One Extension Read More »

Newsletter #305: dotenvx: Commit .env Files to Git Without Leaking Secrets

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

dotenvx: Commit .env Files to Git Without Leaking Secrets

Problem
A .env file stores configuration like API keys and database URLs in plain text.
Because of that, committing it to git would leak every secret. Teams usually gitignore the file and distribute credentials manually through Slack or password managers instead.
Over time, this leads to secrets being scattered across different places without a clear source of truth.
Solution
dotenvx changes this by encrypting .env files with public-key cryptography.
You can commit the encrypted file to git, and your team only needs a private key (kept in a gitignored .env.keys file) to decrypt it when running the application.
Key capabilities:

Works with Python, Node, Go, Ruby, Rust, and more via a single CLI
Encrypts .env files using the same cryptography as Bitcoin (secp256k1)
Separates environments with .env.production, .env.staging, and .env.ci
Requires zero infrastructure (no Vault, no KMS, no cloud setup)

Bandit: Find Python Security Flaws with One Pre-Commit Hook

Problem
AI code generators can produce working code in seconds, but they often introduce risky patterns like hardcoding passwords or API keys directly in the source.
These issues can easily slip through a quick review.
Solution
Bandit is a Python security linter that automatically detects vulnerability patterns in your code, from hardcoded secrets to unsafe function calls.
Key capabilities:

Detects hardcoded passwords, tokens, and API keys
Flags risky calls like eval, exec, and pickle
Seamlessly integrates into pre-commit hooks, CI workflows, and editors
Generates severity-ranked reports so you can prioritize fixes

☕️ Weekly Finds

vulture
[Code Quality]
– Find dead Python code with confidence-scored static analysis

responses
[Testing]
– A utility library for mocking out the Python Requests library

beartype
[Code Quality]
– Unbearably fast near-real-time pure-Python runtime type-checker

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #305: dotenvx: Commit .env Files to Git Without Leaking Secrets Read More »

Newsletter #304: Ibis: Write Once, Query 22+ SQL Databases

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

mem0: Auto-Update LLM Memory When Facts Change

Problem
Developers commonly store conversation embeddings in ChromaDB or Pinecone, then retrieve similar chunks before each LLM response.
But these systems do not handle changing information. When facts evolve, they simply accumulate, leaving your AI with conflicting context and no way to resolve it.
Solution
mem0 introduces a smarter memory layer. It identifies contradictions, updates existing knowledge, and ensures only the latest facts are retained.
Key capabilities:

Extracts structured facts directly from raw conversations
Conflict resolution that replaces outdated facts instead of duplicating them
Memory isolation across users, sessions, and agents
Retrieves context semantically, not just by similarity

Ibis: Write Once, Query 22+ SQL Databases

Problem
Running queries across multiple databases often means rewriting the same logic for each backend’s SQL dialect. A query that works in DuckDB may require syntax changes for PostgreSQL, and another rewrite for BigQuery.
Solution
Ibis removes that friction by compiling Python expressions into each backend’s native SQL. Swap the connection, and the same code runs across 22+ databases.
Key features:

Write once, run on DuckDB, PostgreSQL, BigQuery, Snowflake, and 18+ more
Lazy execution that builds and optimizes the query plan before sending it to the database
Intuitive chaining syntax similar to Polars

📖 View Full Article

☕️ Weekly Finds

doccano
[Data Processing]
– Open source annotation tool for text classification, sequence labeling, and sequence-to-sequence tasks

Data Formulator
[Data Visualization]
– AI-powered data visualization tool that transforms and explores data with drag-and-drop charts and AI agents

qsv
[Data Processing]
– Blazing-fast CLI toolkit for querying, transforming, and analyzing CSV data at scale

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #304: Ibis: Write Once, Query 22+ SQL Databases Read More »

Newsletter #303: Autoresearch: Run ML Experiments on Autopilot with Git-Backed Rollback

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

gws: One CLI for Drive, Gmail, Calendar, and Sheets

Problem
Managing Workspace through the browser means clicking through multiple apps just to pull a spreadsheet, check your calendar, and send a follow-up email.
That manual loop adds up fast when you repeat it daily or weekly.
Solution
gws is a CLI that unifies every Workspace service behind simple terminal commands with structured JSON output ready for scripting.
Key capabilities:

Single interface for Drive, Gmail, Calendar, Sheets, Docs, and more
JSON output that pipes directly into your existing scripts and workflows
100+ AI agent skills that let LLMs orchestrate Workspace tasks programmatically

Autoresearch: Run ML Experiments on Autopilot with Git-Backed Rollback

Problem
Running experiments manually means adjusting one hyperparameter, waiting for training to finish, checking results, and repeating for hours.
Progress stops the moment you step away, and you only explore the narrow set of ideas you thought of.
Solution
Autoresearch is an open-source framework that solves this with an autonomous loop. An AI agent commits each change to git, trains for 5 minutes, and checks whether the model actually improved.
If the metric improves, the change stays. If not, the agent reverts to the last good state automatically.
Key benefits:

Git-backed snapshots before every experiment for instant rollback
Structured results log that survives crashes and tracks every attempt
Continuous looping with no human confirmation needed

☕️ Weekly Finds

PyMC
[ML]
– Bayesian statistical modeling with advanced MCMC and variational inference algorithms

lifelines
[ML]
– Survival analysis in Python, including Kaplan-Meier, Nelson-Aalen, and regression

causal-learn
[ML]
– Causal discovery with constraint-based, score-based, and functional causal model methods

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #303: Autoresearch: Run ML Experiments on Autopilot with Git-Backed Rollback Read More »

Newsletter #302: Type Check Your Python Codebase 15x Faster with Pyrefly

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

Codon: One Decorator to Turn Python into C Speed

Problem
Slow Python functions in large codebases are painful to optimize. You might try Numba or Cython, but Numba only works for numerical code with NumPy arrays.
You might try Cython, but it needs .pyx files, variable type annotations, and build setup. That’s hours of refactoring before you see any speedup.
Solution
Codon solves this with a single @codon.jit decorator that compiles your Python to machine code.
Key benefits:

Works on any Python code, not just NumPy arrays
No type annotations required since types are inferred automatically
Compiled functions are cached for instant repeated calls
Zero code changes beyond adding the decorator

The example below shows the real performance:

Pure Python: 0.240s
Codon first call: 0.324s (one-time compilation)
Codon cached calls: 0.006s (37x faster)

🧪 Run code

Datadog: Trace Bad Data from Dashboard to Root Cause in One View (Sponsored)

Problem
If your pipeline isn’t connected end to end, debugging means jumping between tools and manually tracing the issue. It’s slow and error-prone.
Solution
Instead of jumping between tools, Datadog Data Observability gives you one connected view from ingestion to dashboards.
It does this through:

Quality Monitoring: catches anomalies like missing rows or stale data automatically
Jobs Monitoring: gives visibility into Spark and Airflow runs, including failures and cost
Data and code lineage: traces problems upstream to the source and downstream to every affected dashboard, model, and report

📖 View Full Article

Type Check Your Python Codebase 15x Faster with Pyrefly

Problem
Tools like MyPy and Pyright process files sequentially, so larger codebases lead to longer wait times.
Solution
Pyrefly, Meta’s Rust-based type checker, runs checks in parallel, keeping performance nearly constant as your codebase grows.
Key features:

Re-checks only changed modules for faster incremental runs
Automatically infers types for variables and return values

On the PyTorch codebase, Pyrefly completes a full check in 2.4 seconds, about 15x faster than Pyright and 20x faster than MyPy.

📚 Latest Deep Dives

browser-use: Turn Plain English Prompts into Working Browser Automation

Traditional tools like Playwright rely on CSS selectors, tightly coupling your scraper to a site’s HTML. When the site changes, everything breaks and needs to be rewritten.
browser-use takes a different approach. You describe the goal in plain English, and an LLM decides what to click, type, and extract.
In this article, I tested browser-use on two real tasks:

Finding AI stories on Hacker News and synthesizing themes
Scraping Newegg for gaming laptops with specific constraints

I share the actual outputs, cost per run, and an honest breakdown of what worked and what didn’t so you can decide if it fits your use case.

📖 View Full Article

☕️ Weekly Finds

bandit
[Code Quality]
– A security linter that scans Python code for common vulnerabilities by building and analyzing abstract syntax trees.

scalene
[Code Quality]
– High-performance CPU, GPU, and memory profiler for Python with AI-powered optimization proposals.

vulture
[Code Quality]
– Finds unused code in Python programs, including dead functions, classes, variables, and unreachable code blocks.

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #302: Type Check Your Python Codebase 15x Faster with Pyrefly Read More »

Newsletter #301: Chandra OCR: From Handwritten Notes to Structured Text in Seconds

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

🤝 Collaboration

Give Your AI Agent Live Web Access with Bright Data MCP
With basic search APIs, agents often miss critical context from sources like social platforms, forums, news, and answer engines. That leads to incomplete or outdated responses.
Bright Data’s MCP server unifies all web data access into one interface your AI agent can use directly.
With Bright Data MCP, your AI agent can access:

Search engines (Google, Bing, more)
Social media (Twitter/X, Reddit, Instagram, TikTok)
Web archives (historical web data, years deep)
Answer engines (ChatGPT, Perplexity, Gemini)

All through one connection.

🔗 Try Bright Data MCP

📅 Today’s Picks

Chandra OCR: From Handwritten Notes to Structured Text in Seconds

Problem
Most OCR tools are designed for printed text and struggle with handwritten notes, especially when they include diagrams, equations, and free-form writing.
Solution
Chandra OCR is built for this exact use case. It extracts text, images, and diagrams from handwritten notes and reconstructs them into clean Markdown or HTML.
How it compares to other OCR tools:

85.9% overall on the olmOCR benchmark, outperforming olmOCR 2 (82.4%), GPT-4o (69.9%), Gemini Flash 2 (63.8%), and Mistral OCR (72.0%)
Scores 89.3% on handwritten math, where most OCR tools struggle
Supports 90+ languages out of the box

Worktrunk: Give Every AI Agent Its Own Branch in Seconds

Problem
Git worktrees give each agent its own isolated copy of the repo, so multiple agents can edit files simultaneously without conflicts.
But the native commands are verbose and stop at creating the directory. Launching agents, installing dependencies, and cleaning up after merge are all separate manual steps.
Solution
Worktrunk is a CLI that makes git worktrees as easy as branches with just three core commands: switch, list, and remove.
Three commands cover the full lifecycle:

switch: Create a worktree, run hooks for dependency setup, and launch an agent
list: See diff status, commit counts, CI state, and AI-generated summaries per branch
merge: Squash, rebase, or fast-forward to main with automatic worktree and branch cleanup

☕️ Weekly Finds

niquests
[Python Utils]
– Drop-in replacement for Requests with automatic HTTP/1.1, HTTP/2, and HTTP/3 support, plus WebSocket and SSE built in.

codon
[Python Utils]
– A high-performance Python compiler that produces native machine code with 10-100x speedups and built-in multithreading support.

whenever
[Python Utils]
– Modern, type-safe datetime library for Python with a Rust extension for performance, inspired by Temporal.

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #301: Chandra OCR: From Handwritten Notes to Structured Text in Seconds Read More »

Newsletter #300: Browser-Use: Automate Any Browser Task with Plain English

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

mem0: Give Your AI Memory Without Building a Vector DB

Problem
When you build an AI app using the OpenAI or Anthropic API, every conversation starts from scratch with no built-in memory between sessions.
Adding memory yourself with a vector database like ChromaDB requires writing custom extraction, deduplication, and scoping logic on top of the storage layer.
Solution
mem0 handles all of that in a single function call. Just pass in conversations and retrieve relevant memories when needed.
Key features:

Automatic fact extraction from raw conversations via memory.add()
Cross-session retrieval with memory.search() in any future conversation
Automatic conflict resolution when user preferences change over time

🧪 Run code

Browser-Use: Automate Any Browser Task with Plain English

Problem
Most data collection tasks go beyond simple extraction. You need to log in, apply filters, navigate pagination, and then gather results.
Selenium can handle navigation, but it requires maintaining CSS selectors that can easily break when a site changes.
Solution
Browser Use simplifies the entire process. Describe what you want, and it navigates, clicks, types, and extracts automatically.
Key features:

Natural language task descriptions
Works with GPT-4, Claude, Gemini, and local models via Ollama
Structured output with Pydantic models

📚 Latest Deep Dives

How to Test GitHub Actions Locally with act

Debugging GitHub Actions is painfully slow. Every YAML change requires a commit, a push, and a 2-5 minute wait just to find out you missed a colon.
This article introduces act, a CLI tool that runs GitHub Actions workflows locally in Docker containers.
You’ll set up an ML testing pipeline and learn to pass secrets, run specific jobs, and validate workflows in seconds.

📖 View Full Article

☕️ Weekly Finds

open-webui
[LLM]
– A self-hosted AI platform with built-in RAG, model builder, and support for Ollama and OpenAI-compatible APIs

ragflow
[RAG]
– An open-source RAG engine with deep document understanding for unstructured data in any format

vllm
[MLOps]
– A high-throughput, memory-efficient inference and serving engine for LLMs with multi-GPU support

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #300: Browser-Use: Automate Any Browser Task with Plain English Read More »

Newsletter #299: latexify_py: Turn Python Functions into LaTeX with One Decorator

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

🤝 COLLABORATION

Search engines (Google, Bing, more)
Social media (Twitter/X, Reddit, Instagram, TikTok)
Web archives (historical web data, years deep)
Answer engines (ChatGPT, Perplexity, Gemini)

All through one connection.

🔗 Try Bright Data MCP

📅 Today’s Picks

latexify_py: Turn Python Functions into LaTeX with One Decorator

Problem
Non-programmers cannot easily read Python logic. However, manually converting it to LaTeX is slow and quickly becomes outdated as the code changes.
Solution
latexify_py solves this with a single decorator, generating LaTeX directly from your function so the math stays readable and always in sync with the code.
Key capabilities:

Three decorators for different outputs: expressions, full equations, or pseudocode
Displays rendered LaTeX directly in Jupyter cells
Functions still work normally when called

📖 View Full Article

🧪 Run code

act: Run GitHub Actions Locally with Docker

Problem
GitHub Actions has no local execution mode. You can’t test a step, inspect an environment variable, or reproduce a runner-specific failure on your own machine.
Each change requires a commit and a wait for the cloud runner. A small mistake like a missing secret means starting the loop again.
Solution
With act, you can execute workflows locally using Docker. Failures surface immediately, making it easier to iterate and commit only when the workflow passes.

📚 Latest Deep Dives

How to Test GitHub Actions Locally with act

📖 View Full Article

☕️ Weekly Finds

json_repair
[LLM]
– A Python module to repair invalid JSON, especially from LLM outputs, with schema validation support

pyrsistent
[Python Utilities]
– Persistent, immutable, and functional data structures for Python

prek
[Code Quality]
– A faster, Rust-based reimagining of pre-commit with monorepo support and parallel hook execution

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #299: latexify_py: Turn Python Functions into LaTeX with One Decorator Read More »

Newsletter #298: Chronos: Forecast Any Time Series Without Training a Model

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

🤝 COLLABORATION

Search engines (Google, Bing, more)
Social media (Twitter/X, Reddit, Instagram, TikTok)
Web archives (historical web data, years deep)
Answer engines (ChatGPT, Perplexity, Gemini)

All through one connection.

🔗 Try Bright Data MCP

📅 Today’s Picks

altimate-code: The Missing AI Layer for Data Engineering Teams

Problem
General AI tools can write SQL and catch obvious mistakes. But they cannot systematically detect anti-patterns, trace lineage, or keep warehouse costs under control.
That gap can lead to inefficient queries, broken dependencies, and hidden compliance risks building up over time.
Solution
I recently tried altimate-code, an open-source agent with 100+ tools purpose-built for data engineers, and built a demo repo to test it.
From a single prompt, it generated a full dbt project with staging, intermediate, and mart layers, added automated tests, and built an interactive dashboard.
What makes it different:

100+ tools that analyze SQL through structural parsing, not text guessing
Works across your stack including Snowflake, BigQuery, Databricks, DuckDB, and more
Model-agnostic. Compatible with OpenAI, Anthropic, Gemini, Ollama, and others

Chronos: Forecast Any Time Series Without Training a Model

Problem
Traditional forecasting requires domain-specific data, feature engineering, and multiple rounds of model tuning.
Solution
Chronos is a family of pretrained time series forecasting models from Amazon Science that deliver zero-shot predictions out of the box.
Simply load a pretrained model and generate forecasts on any time series data, with no fine-tuning required.
If zero-shot accuracy isn’t enough, you can fine-tune on your data with AutoGluon in a few lines.

🧪 Run code

📚 Latest Deep Dives

uv vs pixi: Which Python Environment Manager Should You Use for Data Science?

What if one tool could manage both your Python packages and compiled system libraries?
uv installs Python packages from PyPI, but it doesn’t support compiled C/C++ libraries.
The typical workaround is to install system libraries separately using an OS package manager, then manually align versions with your Python dependencies.
Since these system dependencies aren’t captured in project files, reproducing the environment across machines can be unreliable.
pixi solves this by managing both Python packages from PyPI and compiled system libraries from conda-forge in a single tool.
Quick comparison:

uv: fast, reliable lockfiles, Python-only
conda: system libraries supported, but slower and no lockfiles
pixi: fast, unified, with system libraries, lockfiles, and a built-in task runner

In this article, I compare uv and pixi on a real ML project so you can see how they perform in practice.

📖 View Full Article

☕️ Weekly Finds

timesfm
[Machine Learning]
– Pretrained time series foundation model by Google Research for zero-shot forecasting

darts
[Machine Learning]
– A Python library for user-friendly forecasting and anomaly detection on time series

orbit
[Machine Learning]
– A Python package for Bayesian time series forecasting with probabilistic models under the hood

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #298: Chronos: Forecast Any Time Series Without Training a Model Read More »

Newsletter #297: Polars scan_csv: Merge CSVs with Different Schemas in One Call

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

Polars scan_csv: Merge CSVs with Different Schemas in One Call

Problem
Polars’ scan_csv lets you load multiple CSV files lazily, reading data only when needed.
But before v1.39.0, every file had to share the same columns, or you’d get a SchemaError.
Solution
Polars v1.39.0 introduces missing_columns="insert" in scan_csv, allowing you to combine multiple files in one call while null-filling any missing columns.

📖 View Full Article

🧪 Run code

Build Professional Python Packages with UV –package

Problem
Python packages turn your code into reusable modules you can share across projects.
But building them requires complex setup with setuptools, managing build systems, and understanding distribution mechanics.
Solution
UV, a fast Python package installer and resolver, reduces the entire process to 2 simple commands:

uv init –package sets up your package structure instantly
uv build and uv publish to create and distribute to PyPI

📖 Learn more

📚 Latest Deep Dives

uv vs pixi: Which Python Environment Manager Should You Use for Data Science?

uv: fast, reliable lockfiles, Python-only
conda: system libraries supported, but slower and no lockfiles
pixi: fast, unified, with system libraries, lockfiles, and a built-in task runner

In this article, I compare uv and pixi on a real ML project so you can see how they perform in practice.

📖 View Full Article

☕️ Weekly Finds

datachain
[Data Processing]
– Process and curate unstructured data from cloud storages using local ML models and Python

label-studio
[Data Processing]
– Open source data labeling and annotation tool with standardized output format for ML workflows

qsv
[Command Line]
– Blazingly fast CSV command-line toolkit for slicing, dicing, and analyzing tabular data

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #297: Polars scan_csv: Merge CSVs with Different Schemas in One Call Read More »

Drop a line

Get in touch

Follow Us on Social Media

Newsletter Archive

Work with Khuyen Tran

Work with Khuyen Tran