Newsletter Archive Archives

Code example: Analyze GitHub Repositories with LangChain Document Loaders

Newsletter #263: Analyze GitHub Repositories with LangChain Document Loaders

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Build a Simple Portfolio Analyzer in Python with ffn

Problem:

If you have ever wanted a simple way to analyze your investment portfolio as a side project, you know how tedious it is to piece together multiple Python libraries.

Solution:

ffn consolidates the entire portfolio analysis workflow into one package with a Pandas-like API.Core features:
Fetch stock prices directly from Yahoo Finance
Calculate returns and risk metrics automatically
Find the best allocation across your assets
Plot performance comparisons and correlations

Run Code

View GitHub

Analyze GitHub Repositories with LangChain Document Loaders

Problem:

Are you tired of manually searching through hundreds of GitHub issues with keyword search to find what you need?

Solution:

With LangChain’s GitHubIssuesLoader, you can load repository issues into a vector store and query them with natural language instead of exact keywords.You can ask questions like “What feature requests are related to video?” and get instant, relevant answers from your issue history.

Full Article:

Run Private AI Workflows with LangChain and Ollama

Run Code

View GitHub

☕️
Weekly Finds

PlotNeuralNet

Data Viz

LaTeX code for drawing publication-quality neural network diagrams for reports and presentations

yellowbrick

Visual analysis and diagnostic tools for machine learning with scikit-learn integration

TPOT

MLOps

Python Automated Machine Learning tool that optimizes ML pipelines using genetic programming

Looking for a specific tool?
Explore 70+ Python tools →

Favorite

Newsletter #263: Analyze GitHub Repositories with LangChain Document Loaders Read More »

Code example: Build Visual Tables with Great Tables Nanoplots

Newsletter #261: Build Visual Tables with Great Tables Nanoplots

Leave a Comment / Newsletter Archive / Khuyen Tran

🤝
COLLABORATION

Data Contracts: Developing Production Grade Pipelines at Scale

Poor data quality can cause major problems for data teams, from disrupting pipelines to losing consumer trust. Many teams struggle with this, especially when data comes from upstream workflows outside their control.The solution: data contracts. They document expectations, establish ownership, and enforce constraints within CI/CD workflows.This practical book introduces data contract architecture, explains why the industry needs it, and shares real-world production use cases. You’ll learn to implement components and build a case for adoption in your organization.

Try Chapter 7 in your browser

📅
Today’s Picks

Build Visual Tables with Great Tables Nanoplots

Problem:

Data tables with raw numbers lack visual context.You can’t spot trends or patterns at a glance when looking at columns of digits.

Solution:

Great Tables’ fmt_nanoplot() embeds mini line or bar charts directly into table cells.Key features:
Transform numeric series into scannable visualizations
Customize colors and styles for data points and lines
Switch between line plots and bar charts
Add data area shading for emphasis

Full Article:

Great Tables: Build Publication-Ready Tables in Python

Run Code

View GitHub

⭐
Related Post

Great Tables: Transform DataFrames into Publication-Ready Reports

Problem:

Standard DataFrame output can feel clunky and unfinished. Without clean headers, readable dates, or currency formatting, even great data can look unprofessional.

Solution:

Great Tables elevates your DataFrames into polished tables built for reports, dashboards, and presentations, all through one chainable interface.Key features:
Number formatting: currency, dates, compact notation
Visual enhancements: mini charts, color gradients, embedded images
Table structure: headers, subtitles, column control
Multi-format export: PNG, PDF, HTML

Full Article:

Great Tables: Build Publication-Ready Tables in Python

Run Code

View GitHub

☕️
Weekly Finds

TabPFN

Foundation model for tabular data with zero-shot classification and regression capabilities

scikit-survival

Survival analysis built on top of scikit-learn for time-to-event prediction

dedupe

Data Processing

Python library for fuzzy matching, record deduplication and entity resolution using machine learning

Looking for a specific tool?
Explore 70+ Python tools →

Favorite

Newsletter #261: Build Visual Tables with Great Tables Nanoplots Read More »

Code example: LangChain v1.0: Auto-Protect Sensitive Data with PIIMiddleware

Newsletter #259: LangChain v1.0: Auto-Protect Sensitive Data with PIIMiddleware

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

LangChain v1.0: Auto-Protect Sensitive Data with PIIMiddleware

Problem:

User messages often contain sensitive information like emails and phone numbers.Logging or storing this data without protection creates compliance and security risks.

Solution:

LangChain v1.0 introduces PIIMiddleware to automatically protect sensitive data before model processing.PIIMiddleware supports multiple protection modes:
5 built-in detectors (email, credit card, IP, MAC, URL)
Custom regex for any PII pattern
Replace with [REDACTED], mask as ****1234, or block entirely

Full Article:

Build Production-Ready LLM Agents with LangChain 1.0 Middleware

Run Code

View GitHub

Test File Operations Without Risk Using tmp_path

Problem:

Testing file operations requires touching the actual file system, which can be dangerous if not handled carefully. Real data can be overwritten by mistake.Tests can also leave behind unwanted files across your project.

Solution:

The tmp_path fixture provides a safe alternative by creating temporary, isolated directories that clean up themselves after each test.Here’s how to use tmp_path:
Add tmp_path to your test function signature
Work with it like any pathlib.Path object
pytest handles the rest: isolated directories per test, automatic cleanup

Learn More:

Production-Ready Data Science: From Prototyping to Production with Python

Run Code

📚
Latest Deep Dives

Great Tables: Publication-Ready Tables from Polars and Pandas DataFrames

Turn Polars and Pandas DataFrames into professional tables with automatic number formatting, visual heatmaps, and sparkline charts. Fully reproducible when data updates.

☕️
Weekly Finds

quarkdown

Python Utils

Modern Markdown-based typesetting system that compiles projects into print-ready books or interactive presentations with live preview and fast compilation

slim

MLOps

Container optimization tool that makes Docker images 10-30x smaller without changing your development workflow

shapiq

Python package for approximating Shapley interactions and explaining feature interactions in machine learning model predictions

Looking for a specific tool?
Explore 70+ Python tools →

Favorite

Newsletter #259: LangChain v1.0: Auto-Protect Sensitive Data with PIIMiddleware Read More »

Code example: Delta Lake: Sync DataFrames with One Line of Code

Newsletter #257: Delta Lake: Sync DataFrames with One Line of Code

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Delta Lake: Sync DataFrames with One Line of Code

Problem:

Updating an entire table with millions of rows just to fix a handful of records is costly and unnecessary.

Solution:

Delta Lake’s when_matched_update() modifies only the matching rows, leaving unchanged data untouched.Delta Lake also gives you:
Atomic updates that fully succeed or fully roll back
Partial file rewrites instead of full dataset copies
Time travel to restore previous versions

Full Article:

Delta Lake: Transform pandas Prototypes into Production

Run Code

View GitHub

⭐
Related Post

Delta Lake: Time Travel Your Data Pipeline

Problem:

Once data is overwritten in pandas, previous versions are lost forever.You can’t debug pipeline issues or rollback bad changes when your data history disappears.

Solution:

Delta Lake maintains version history allowing you to query any previous state of your data by timestamp or version number.Use cases:
Compare today’s sales data with yesterday’s to spot revenue anomalies
Recover accidentally deleted customer records from last week’s backup
Audit financial reports using data exactly as it existed at quarter-end

Full Article:

Delta Lake: Transform pandas Prototypes into Production

Run Code

View GitHub

📢
ANNOUNCEMENTS

New Way to Explore Python Tools on CodeCut

CodeCut just became a lot more searchable! ☕️Most developers stick to the same 5-10 Python libraries. Not because others aren’t useful, but because finding them is hard.CodeCut now organizes 69 tools into eight categories on the homepage:
Developer tools
AI frameworks
Data processing
Databases
Python builtins
Visualization
Utilities
Text processing
Each tool links directly to blogs and code snippets. Browse a new category today. You might find something that changes how you work.What would make CodeCut more useful for you? Reply and let me know. I’m always looking for ways to improve it.

Explore Tools

☕️
Weekly Finds

deepdoctection

Python library for document extraction and layout analysis using deep learning models

postgresus

Data Engineer

Self-hosted PostgreSQL backup and monitoring tool with web UI

ffn

Data Processing

Financial functions library for quantitative finance in Python

Favorite

Newsletter #257: Delta Lake: Sync DataFrames with One Line of Code Read More »

Code example: Build Scalable Pipelines with DuckDB Memory Spilling

Newsletter #256: Build Scalable Pipelines with DuckDB Memory Spilling

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Marimo: Keep All Notebook Cells in Sync Without Manual Reruns

Problem:

In Jupyter notebooks, changing an input value doesn’t automatically update dependent cells.Forget to rerun one cell, and you might make decisions based on outdated results without realizing anything is wrong.

Solution:

Marimo automatically detects changes and re-executes all dependent cells.When you change a variable like threshold from 50 to 30, every downstream cell that uses it updates immediately.

Learn More:

Production-Ready Data Science: From Prototyping to Production with Python

Run Code

View GitHub

Build Scalable Pipelines with DuckDB Memory Spilling

Problem:

When datasets exceed available RAM, most tools crash mid-operation.This forces manual data chunking or expensive hardware upgrades just to complete basic queries.

Solution:

DuckDB automatically spills intermediate results to temporary files when data exceeds configured memory limits.Key benefits:
Process datasets larger than RAM without code changes
Configure memory limits to prevent system crashes
Automatic disk spillover when memory fills
No manual chunking or batching required

Full Article:

pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool

Run Code

View GitHub

📢
ANNOUNCEMENTS

Cyber Monday: 30% Off Production-Ready Data Science

My book Production-Ready Data Science is on sale for Cyber Monday.Get 58% off the ebook or 10% off the paperback through December 8th.The book covers everything I’ve learned about taking data science from prototype to production: dependency management, testing, CI/CD, and workflow automation.

Get 58% Off Now

☕️
Weekly Finds

Nano-PDF

LLM

Natural language PDF editing using Gemini with multi-page parallel processing

Codon

Python Utils

High-performance Python compiler that generates native machine code for 10-100x speedups

lm-evaluation-harness

LLM

Unified framework for few-shot evaluation of language models across 200+ tasks

Favorite

Newsletter #256: Build Scalable Pipelines with DuckDB Memory Spilling Read More »

Code example: Polars v1.35: Native Rolling Rank for Time Series

Newsletter #255: Polars v1.35: Native Rolling Rank for Time Series

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Polars v1.35: Native Rolling Rank for Time Series

Problem:

How do you rank values within a rolling window?For example, you might want to compare today’s sales to the last 3 days to answer: “How does today’s sales compare to the last 3 days?”

Solution:

Polars v1.35 introduces rolling_rank() for native window ranking operations.How it works:
Define a window size (e.g., last 3 values)
Each value gets ranked against others in its window
Rank 1 = lowest, Rank N = highest
This method is useful for tracking performance over time, detecting anomalies, or alerting when metrics underperform.

Full Article:

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

Run Code

View GitHub

Coiled: Run Python in the Cloud with One Decorator (Sponsored)

Problem:

Imagine you need to run data processing on a file that is larger than your laptop’s RAM. What should you do?Traditional solutions require buying more RAM, renting expensive cloud VMs, or learning Kubernetes. All of these add complexity and cost.

Solution:

Coiled’s serverless functions let you run your Python code on cloud VMs with the memory you need by simply adding a decorator.Key capabilities:
Use any data framework: pandas, Polars, DuckDB, Dask, and more
Process multiple files in parallel with .map()
Sync local packages to cloud without Docker
Cut costs with spot instances and auto-fallback

Full Article:

Coiled: Scale Python Data Pipeline to the Cloud in Minutes

View Website

📢
ANNOUNCEMENTS

Cyber Monday: 58% Off Production-Ready Data Science

Get 58% Off Now

☕️
Weekly Finds

codon

Python Utils

A high-performance, zero-overhead, extensible Python compiler with built-in NumPy support

khoj

LLM

Your AI second brain. Self-hostable personal assistant with RAG, semantic search, and support for PDFs, Markdown, Notion, and more

lm-evaluation-harness

MLOps

A framework for few-shot evaluation of language models. Powers Hugging Face’s Open LLM Leaderboard

Favorite

Newsletter #255: Polars v1.35: Native Rolling Rank for Time Series Read More »

Code example: Pydantic v2.12: Skip Computed Fields During Serialization

Newsletter #254: Pydantic v2.12: Skip Computed Fields During Serialization

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Pydantic v2.12: Skip Computed Fields During Serialization

Problem:

By default, Pydantic’s model_dump() serializes computed fields alongside the base fields used to derive them.This duplicates data and increases API response sizes.

Solution:

Pydantic v2.12 adds the exclude_computed_fields parameter to model_dump().This lets you keep computed fields for internal use while excluding them from API responses.

Run Code

View GitHub

⭐
Related Post

Faster Polars Queries with Programmatic Expressions

Problem:

When you want to use for loops to apply similar transformations, each Polars with_columns() call processes sequentially.This prevents the optimizer from seeing the full computation plan.

Solution:

Instead, generate all Polars expressions programmatically before applying them together.This enables Polars to:
See the complete computation plan upfront
Optimize across all expressions simultaneously
Parallelize operations across CPU cores

Full Article:

Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames

Run Code

View GitHub

☕️
Weekly Finds

llm-council

LLM

Query multiple LLMs in parallel, anonymize responses, and have them rank each other for better answers

skweak

Build NER models without labeled data using weak supervision for NLP tasks

wrapt

Python Utils

Create transparent decorators, wrappers, and monkey patches in Python

Favorite

Newsletter #254: Pydantic v2.12: Skip Computed Fields During Serialization Read More »

Code example: Docling: Auto-Annotate PDF Images Locally

Newsletter #253: Docling: Auto-Annotate PDF Images Locally

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Docling: Auto-Annotate PDF Images Locally

Problem:

Images in PDFs like charts, diagrams, and figures are invisible to search and analysis. Manually writing descriptions for hundreds of figures is impractical.You could use cloud APIs like Gemini or ChatGPT, but that means API costs at scale and your documents leaving your infrastructure.

Solution:

Docling runs local vision language models (Granite Vision, SmolVLM) to automatically generate descriptive annotations for every picture in your documents, keeping data private.Key benefits:
Privacy: Data stays local, works offline
Cost: No per-image API fees
Flexibility: Customizable prompts, any HuggingFace model

Full Article:

From Messy PDFs to RAG-Ready Data: Complete Document Processing with Docling

Run Code

View GitHub

Rembg: Remove Image Backgrounds in 2 Lines of Python

Problem:

Removing backgrounds from images typically requires Photoshop, online tools, or AI assistants like ChatGPT.But these options come with subscription costs, upload limits, or privacy concerns with your images on external servers.

Solution:

Rembg uses AI models to remove backgrounds locally with just 2 lines of Python.It’s also open source and compatible with common Python imaging libraries.

Run Code

View GitHub

☕️
Weekly Finds

label-studio

MLOps

Multi-type data labeling and annotation tool with standardized output format

reflex

Python Utils

Build full-stack web apps in pure Python – no JavaScript required

TradingAgents

LLM

Multi-agent LLM financial trading framework

Favorite

Newsletter #253: Docling: Auto-Annotate PDF Images Locally Read More »

Code example: Build Fast Recommendations with Annoy's Memory-Mapped Indexes

Newsletter #252: Build Fast Recommendations with Annoy’s Memory-Mapped Indexes

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Build Fast Recommendations with Annoy’s Memory-Mapped Indexes

Problem:

sklearn loads all your item vectors into memory and compares your search vector against every single item in your dataset.This can take seconds or minutes when you have millions of items.

Solution:

Annoy (Approximate Nearest Neighbors Oh Yeah), built by Spotify, speeds up similarity search by organizing your vectors into a searchable tree structure.How it works:
Pre-builds indexes with “build(n_trees)”, creating multiple trees by recursively splitting your vector space with random hyperplanes
Traverses tree splits to find the n nearest neighbors using “get_nns_by_item(i, n)”
Checks only items in the final region instead of scanning everything
As a result, you can query millions of items in milliseconds instead of seconds.

Run Code

View GitHub

Build Reliable DataFrame Tests with assert_frame_equal

Problem:

Testing numerical code with regular assertions can lead to false failures from floating-point precision.Your perfectly correct function fails tests because 0.1 + 0.2 doesn’t exactly equal 0.3 in computer arithmetic.

Solution:

Use numpy.testing and pandas.testing utilities for robust numerical comparisons.Key approaches:
assert_array_almost_equal for NumPy arrays with decimal precision control
pd.testing.assert_frame_equal for DataFrame comparisons with tolerance
Handle floating-point arithmetic limitations properly
Get reliable test results for numerical data processing
Professional data science requires proper numerical testing methods.

Learn More:

Production-Ready Data Science: From Prototyping to Production with Python

Run Code

☕️
Weekly Finds

sympy

Python Utils

A computer algebra system written in pure Python for symbolic mathematics

qdrant

LLM

High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI

mindsdb

Federated query engine for AI – connect to hundreds of data sources and generate intelligent responses using built-in agents