Newsletter Archive

Code example: Transform Document Images into Spreadsheets with LlamaParse

Newsletter #231: Transform Document Images into Spreadsheets with LlamaParse

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Transform Document Images into Spreadsheets with LlamaParse

Problem:

Converting document images such as receipts to structured spreadsheet data requires tedious typing and careful validation.

Solution:

LlamaParse automates document data extraction by combining OCR parsing with schema validation, eliminating manual typing and human error.Here is an example pipeline for extracting receipt data:
Parse receipt images to markdown using LlamaParse OCR engine
Define receipt structure with Pydantic models (company, date, items, totals)
Extract structured data automatically with OpenAI integration
Validate types and enforce business rules (positive prices, valid dates)
Export to pandas DataFrames or spreadsheets for analysis

Full Article:

Turn Receipt Images into Spreadsheets with LlamaIndex

Run Code

View GitHub

Solve Algebra Symbolically in Python with SymPy

Problem:

Have you ever needed to expand or factor complex expressions but found yourself doing tedious algebra by hand?Numeric libraries like NumPy can’t solve symbolic equations or manipulate algebraic expressions.

Solution:

SymPy transforms Python into a powerful symbolic mathematics system.Key capabilities:
Solve equations for any variable symbolically
Perform algebraic manipulations like expand, factor, and substitute
Generate LaTeX output for mathematical documentation
Integrate seamlessly with Jupyter notebooks and NumPy workflows

Full Article:

3 Tools That Automatically Convert Python Code to LaTeX Math

Run Code

View GitHub

☕️
Weekly Finds

BERTopic

Leveraging BERT and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

mesop

Python Utils

Rapidly build AI apps in Python – A Python-based UI framework that allows you to rapidly build web apps like demos and internal apps

crawlee-python

Data Processing

A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs

Favorite

Newsletter #231: Transform Document Images into Spreadsheets with LlamaParse Read More »

Code example: PySpark Transformations: Python API vs SQL Expressions

Newsletter #230: PySpark Transformations: Python API vs SQL Expressions

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

PySpark Transformations: Python API vs SQL Expressions

Problem:

PySpark offers two ways to handle SQL transformations. How do you know which one to use?

Solution:

Choose based on your development style and team expertise.Use the DataFrame API if you’re comfortable with Python and need Python-native development with type safety and autocomplete support.Use selectExpr() if you’re comfortable with SQL and need familiar SQL patterns and simplified CASE statements.Both methods deliver the same performance, so pick the approach that fits your workflow.

Full Article:

The Complete PySpark SQL Guide: DataFrames, Aggregations, Window Functions, and Pandas UDFs

Run Code

View GitHub

☕️
Weekly Finds

dotenvx

Python Utils

A secure dotenv with encryption, syncing, and zero-knowledge key sharing to make .env files secure and team-friendly

databases

Data Processing

Async database support for Python with support for PostgreSQL, MySQL, and SQLite

pomegranate

Fast and flexible probabilistic modeling in Python implemented in PyTorch

⭐
Related Post

DuckDB: Zero-Config SQL Database for DataFrames

Problem:

Setting up database servers for SQL operations requires complex configuration, service management, and credential setup.This creates barriers between data scientists and their analytical workflows.

Solution:

DuckDB provides an embedded SQL database with zero configuration required.Key benefits:
No server installation or management needed
Direct SQL operations on DataFrames and files
Compatible with pandas, Polars, and Arrow ecosystems
Fast analytical queries with columnar storage
Open-source with active development community
Query your data instantly without database administration overhead.

Full Article:

A Deep Dive into DuckDB for Data Scientists

Run Code

View GitHub

Favorite

Newsletter #230: PySpark Transformations: Python API vs SQL Expressions Read More »

Code example: latexify: Turn Python Functions Into Clean Math Formulas

Newsletter #229: latexify: Turn Python Functions Into Clean Math Formulas

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Build Faster Tests with pytest Session Fixtures

Problem:

pytest fixtures provide reusable test data, but they reload for every test function by default.When your fixture loads a large DataFrame, every test reloads the same data, wasting time and delaying your development workflow.

Solution:

Session-scoped fixtures load data once at the start and reuse it across all test functions.Apply this pattern to:
Load large datasets once instead of reloading for each test function
Share a database connection across all tests without passing it as a parameter
Automatically set random seeds for reproducible train/test splits

Learn More:

Production-Ready Data Science: From Prototyping to Production with Python

Run Code

latexify: Turn Python Functions Into Clean Math Formulas

Problem:

It is not ideal to present mathematical formulas written in Python code to executives and stakeholders as they are often not familiar with Python code.However, writing LaTeX manually to show the formulas is time-consuming and tedious.

Solution:

latexify transforms Python functions into clean mathematical notation with a single decorator. No manual LaTeX required.Key features:
Automatic LaTeX generation from Python functions
Functions remain executable for calculations
Compatible with various notebooks such as Jupyter, Colab, and Marimo

Full Article:

3 Tools That Automatically Convert Python Code to LaTeX Math

Run Code

View GitHub

☕️
Weekly Finds

Python Utils

An extremely fast Python type checker and language server, written in Rust

giotto-tda

A high-performance topological machine learning toolbox in Python built on top of scikit-learn

vibekit

MLOps

Run Claude Code, Gemini, Codex — or any coding agent — in a clean, isolated sandbox with sensitive data redaction and observability baked in

Favorite

Newsletter #229: latexify: Turn Python Functions Into Clean Math Formulas Read More »

Code example: Create Dynamic Scatter Plots with Plotly Animation

Newsletter #228: Create Dynamic Scatter Plots with Plotly Animation

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Create Dynamic Scatter Plots with Plotly Animation

Problem:

Static scatter plots can’t show how data clusters change and evolve over time.

Solution:

Plotly Express creates animated scatter plots that change over time in one line of code.Key benefits:
Simply add the animation_frame=”time_column” parameter to px.scatter to create an animated scatter plot
Automatic smooth transitions between time periods
Built-in playback controls for user interaction
Works with any time-series dataset

Full Article:

Top 6 Python Libraries for Visualization: Which One to Use

Run Code

View GitHub

CloudQuery: Move RAG Data with 18-Line YAML (Sponsored)

Problem:

RAG applications need data from various sources moved into vector stores. Manual API integration means writing boilerplate for rate limiting, pagination, and error handling instead of building AI.

Solution:

CloudQuery handles the entire data-to-embeddings pipeline with declarative YAML config and native pgvector support.Key benefits:
Pre-built connectors for AWS, GCP, Azure, and 100+ platforms
Sync state persistence with incremental processing and automatic schema evolution
Built-in PII removal, column obfuscation, and data cleaning for compliance
Native pgvector support: text splitting, embeddings, semantic indexing for RAG

Full Article:

Hacker News Semantic Search: Production RAG with CloudQuery and Postgres

View GitHub

☕️
Weekly Finds

ShinkaEvolve

An open-source framework that evolves programs for scientific discovery with unprecedented sample-efficiency

claude-code-router

LLM

A powerful tool to route Claude Code requests to different models and customize any request

data-formulator

Data Viz

AI-driven tool designed to streamline the creation of data visualizations

Favorite

Newsletter #228: Create Dynamic Scatter Plots with Plotly Animation Read More »

Code example: LangGraph: Turn Any Python Function Into Agent Tools

Newsletter #227: LangGraph: Turn Any Python Function Into Agent Tools

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

LangGraph: Turn Any Python Function Into Agent Tools

Problem:

AI agents need specialized tools to interact with the world beyond their training data like searching the web, querying databases, executing code, and integrating with APIs.However, if there are too many tools, it becomes difficult to connect them to user requests intelligently.

Solution:

LangGraph’s create_react_agent eliminates this entirely with LLM reasoning.Key benefits of ReAct agents:
Handles fuzzy user requests by letting the LLM choose tools on the fly
Lets you drop in new @tool functions without touching control flow
Turns any Python function into an agent-accessible tool

Full Article:

Building Coordinated AI Agents with LangGraph: A Hands-On Tutorial

Run Code

View GitHub

☕️
Weekly Finds

MindsDB

AI data automation solution that connects and unifies petabyte scale enterprise data, enabling informed decision-making in real-time

gspread

Python Utils

Google Sheets Python API for managing Google Spreadsheets programmatically

wrapt

Python Utils

Python module for decorators, wrappers and monkey patching with transparent object proxy

⭐
Related Post

Query GitHub Issues with Natural Language Using LangChain

Problem:

Have you ever spent hours clicking through GitHub pages to understand project status, track bugs, or review recent changes? Manual repository analysis wastes development time that could be spent building features.

Solution:

LangChain’s GitHubIssuesLoader converts repository issues and PRs into searchable content that responds to natural language questions about bugs, features, and project status.This method integrates seamlessly with LangChain workflows.

Full Article:

Run Private AI Workflows with LangChain and Ollama

Run Code

View GitHub

Favorite

Newsletter #227: LangGraph: Turn Any Python Function Into Agent Tools Read More »

Code example: Gradio: Turn Python Functions into Interactive AI Demos

Newsletter #226: Gradio: Turn Python Functions into Interactive AI Demos

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Query Nested JSON with DuckDB SQL Dot Notation

Problem:

Working with nested JSON structures requires complex normalization steps in pandas before analysis.

Solution:

DuckDB automatically flattens nested JSON files and allows direct querying of nested fields with dot notation.Other key benefits:
High-performance columnar engine for analytical workloads
Zero external dependencies – embedded database design
Native support for Parquet, CSV, JSON without data movement
Direct integration with pandas, NumPy, and Arrow format

Full Article:

A Deep Dive into DuckDB for Data Scientists

Run Code

View GitHub

Gradio: Turn Python Functions into Interactive AI Demos

Problem:

You built an AI model that works well for your use case in your notebook. But how do you demo it to stakeholders?Your stakeholders expect clickable demos, not code snippets, but building web interfaces requires frontend expertise you don’t have.

Solution:

With Gradio, you can create professional chat interfaces with just 10 lines of code.Key benefits:
Instant UI generation from Python functions
Zero frontend coding required
Share live demos with URL links without any deployment

Full Article:

Build a Complete RAG System with 5 Open-Source Tools

Run Code

View GitHub

☕️
Weekly Finds

presidio

Data Processing

Context aware, pluggable and customizable PII de-identification service for text and images

testcontainers-python

Python Utils

Python library providing a friendly API to run Docker containers for functional and integration testing

shapash

Python library dedicated to the interpretability of Data Science models with explicit visualization labels

Favorite

Newsletter #226: Gradio: Turn Python Functions into Interactive AI Demos Read More »

Code example: Query GitHub Issues with Natural Language Using LangChain

Newsletter #225: Query GitHub Issues with Natural Language Using LangChain

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Query GitHub Issues with Natural Language Using LangChain

Problem:

Solution:

Full Article:

Run Private AI Workflows with LangChain and Ollama

Run Code

View GitHub

Mock External APIs for Fast, Reliable Tests

Problem:

Testing with real APIs and databases is slow, expensive, and unreliable.External dependencies create flaky tests that can fail due to network issues, rate limits, or service downtime rather than code problems.

Solution:

The patch decorator replaces external calls with controllable mock objects for isolated testing.Key benefits:
Reproducible results across different machines
Fast, reliable tests that focus on your logic
Test edge cases and error conditions that are hard to trigger naturally
Test your data processing logic without waiting for external services or consuming API quotas.

Full Article:

Pytest for Data Scientists

Run Code

☕️
Weekly Finds

timesketch

Python Utils

Collaborative forensic timeline analysis tool for organizing and analyzing forensic timelines

ExtractThinker

LLM

AI-powered Document Intelligence library for LLMs, offering ORM-style interaction for flexible document workflows

ecco

Explain, analyze, and visualize NLP language models with interactive visualizations in Jupyter notebooks

Favorite

Newsletter #225: Query GitHub Issues with Natural Language Using LangChain Read More »

Code example: Delta Lake vs pandas: Stop Silent Data Corruption

Newsletter #224: Delta Lake vs pandas: Stop Silent Data Corruption

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Delta Lake vs pandas: Stop Silent Data Corruption

Problem:

Pandas allows type coercion during DataFrame operations. A single string value can silently convert numeric columns to object dtype, breaking downstream systems and corrupting data integrity.

Solution:

Delta Lake prevents these issues through strict schema enforcement at write time, validating data types before ingestion to maintain table integrity.Other features of Delta Lake:
Time travel provides instant access to any historical data version
ACID transactions guarantee data consistency across all operations
Smart file skipping eliminates 95% of unnecessary data scanning
Incremental processing handles billion-row updates efficiently

Full Article:

Delta Lake: Transform pandas Prototypes into Production

Run Code

View GitHub

☕️
Weekly Finds

ZeroFS

Data Engineer

ZeroFS – The Filesystem That Makes S3 your Primary Storage. Provides file-level access via NFS and 9P and block-level access via NBD on S3 storage with encryption, caching, and high performance.

vicinity

Lightweight Nearest Neighbors with Flexible Backends. Provides a unified interface for vector similarity search with support for multiple backends like HNSW, FAISS, Annoy, and more.

vec2text

LLM

Utilities for decoding deep representations (like sentence embeddings) back to text. Train models to reconstruct text sequences from embeddings and invert pre-trained embeddings.

⭐
Related Post

Delta Lake: Time Travel Your Data Pipeline

Problem:

Once data is overwritten in pandas, previous versions are lost forever.You can’t debug pipeline issues or rollback bad changes when your data history disappears.

Solution:

Delta Lake maintains version history allowing you to query any previous state of your data by timestamp or version number.Use cases:
Compare today’s sales data with yesterday’s to spot revenue anomalies
Recover accidentally deleted customer records from last week’s backup
Audit financial reports using data exactly as it existed at quarter-end

Full Article:

Delta Lake: Transform pandas Prototypes into Production

Run Code

View GitHub

Favorite

Newsletter #224: Delta Lake vs pandas: Stop Silent Data Corruption Read More »

Code example: ChromaDB's Automatic Indexing: Fast Vector Search Made Easy

Newsletter #223: ChromaDB’s Automatic Indexing: Fast Vector Search Made Easy

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Type-Safe Configuration Management with Hydra

Problem:

Configuration errors and type mismatches often go undetected until runtime, wasting time and computing resources.

Solution:

Hydra’s structured configurations with dataclasses validate types before your code runs, preventing configuration crashes.What Hydra adds to dataclasses:
Runtime parameter overrides from command line
Configuration composition and inheritance
Built-in experiment management and logging
Run multiple parameters in one command

Learn More:

Production-Ready Data Science: From Prototyping to Production with Python

Run Code

View GitHub

ChromaDB’s Automatic Indexing: Fast Vector Search Made Easy

Problem:

Why saving vector embeddings in a file is not enough?Basic file storage forces you to scan every single embedding for similarity search, creating massive performance bottlenecks as your dataset grows.

Solution:

ChromaDB provides persistent vector storage with automatic indexing and metadata filtering capabilities.Key benefits:
Find relevant content by meaning, not just keyword matching
Handle large datasets without memory crashes using efficient indexing
Complete toolkit included: similarity scoring, deduplication, search ranking, and more

Full Article:

Build a Complete RAG System with 5 Open-Source Tools

Run Code

View GitHub

☕️
Weekly Finds

wrapt

Python Utils

A Python module for decorators, wrappers and monkey patching

TabPFN

A transformer-based foundation model for tabular data that outperforms traditional methods

superduperdb

Data Processing

A Python framework for integrating AI models, APIs, and vector search engines directly with your existing databases

Favorite

Newsletter #223: ChromaDB’s Automatic Indexing: Fast Vector Search Made Easy Read More »

Code example: Build Dynamic AI Prompts with LangChain Templates

Newsletter #222: Build Dynamic AI Prompts with LangChain Templates

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

DuckDB: Zero-Config SQL Database for DataFrames

Problem:

Setting up database servers for SQL operations requires complex configuration, service management, and credential setup.This creates barriers between data scientists and their analytical workflows.

Solution:

Full Article:

A Deep Dive into DuckDB for Data Scientists

Run Code

View GitHub

Build Dynamic AI Prompts with LangChain Templates

Problem:

Hard-coded prompts limit flexibility and make it difficult to adapt AI applications to different contexts or user inputs.Creating separate functions for each prompt variation leads to duplicate code with no reusability.

Solution:

LangChain’s PromptTemplate enables dynamic, reusable prompts with variable substitution.Create one template that adapts to multiple contexts:
Variable substitution with {topic}, {audience}, {examples}
Single template for unlimited prompt variations
Clean, maintainable code structure
Compatible with all major LLM providers
Transform repetitive hard-coded prompts into flexible, reusable templates that scale with your AI application needs.

Full Article:

Run Private AI Workflows with LangChain and Ollama

View GitHub

☕️
Weekly Finds

GHunt

Python Utils

Modulable OSINT tool designed to investigate Google accounts and objects using various techniques

nbQA

Python Utils

Run ruff, isort, pyupgrade, mypy, pylint, flake8, and more on Jupyter Notebooks

pg_vectorize

LLM