Newsletter Archive Archives

Code example: Delta Lake: Never Lose Data to Failed Writes Again

Newsletter #212: Delta Lake: Never Lose Data to Failed Writes Again

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Delta Lake: Never Lose Data to Failed Writes Again

Problem:

Have you ever had a pandas operation fail midway through writing data, leaving you with corrupted datasets?Partial writes create inconsistent data states that can break downstream analysis and reporting workflows.

Solution:

Delta Lake provides ACID transactions that guarantee all-or-nothing writes with automatic rollback on failures.ACID properties:
Atomicity: Complete transaction success or automatic rollback
Consistency: Data consistency guaranteed
Isolation: Safe concurrent operations
Durability: Version history with time travel

Full Article:

Delta Lake: Never Lose Data to Failed Writes Again

View GitHub

☕️
Weekly Finds

TinyDB

Database

Lightweight, document-oriented database written in pure Python with no external dependencies. Designed to be simple and developer-friendly, storing data in JSON format by default.

ollama-python

LLM

Python library that provides the easiest way to integrate Python 3.8+ projects with Ollama, an open-source large language model platform. Offers both synchronous and asynchronous client interfaces for seamless AI model interaction.

PyMC

Python package for Bayesian statistical modeling that focuses on advanced Markov chain Monte Carlo (MCMC) and variational inference (VI) algorithms. Enables researchers and data scientists to build sophisticated Bayesian models with minimal algorithmic complexity.

⭐
Related Post

From pandas Full Reloads to Delta Lake Incremental Updates

Problem:

Processing entire datasets when you only need to add a few new records wastes time and memory.Pandas lacks incremental append capabilities, requiring full dataset reload for data updates.

Solution:

Delta Lake’s append mode processes only new data without touching existing records.Key advantages:
Append new records without full dataset reload
Memory usage scales with new data size, not total dataset size
Automatic data protection prevents corruption during updates
Time travel enables rollback to previous dataset versions
Perfect for production data pipelines that need reliable incremental updates.

Full Article:

From pandas Full Reloads to Delta Lake Incremental Updates

View GitHub

Favorite

Newsletter #212: Delta Lake: Never Lose Data to Failed Writes Again Read More »

Code example: Secure Database Queries with DuckDB Parameters

Newsletter #211: Secure Database Queries with DuckDB Parameters

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Secure Database Queries with DuckDB Parameters

Problem:

F-strings create SQL injection vulnerabilities by inserting values directly into queries.

Solution:

DuckDB’s parameterized queries use placeholders to safely pass parameters and prevent SQL injection attacks.Other key features of DuckDB:
In-Process Analytics – No external database needed
Fast Performance – Columnar storage for speed
Zero Setup – Works instantly in Python
DataFrame Integration – Native pandas support

Full Article:

Secure Database Queries with DuckDB Parameters

View GitHub

Build Semantic Text Matching with Sentence Transformers

Problem:

RapidFuzz, which I introduced in my previous post, excels at lightning-fast string matching.However, it cannot understand semantic relationships. It scores ‘running shoes’ vs ‘athletic footwear’ at only 0.267 despite describing similar product categories.RapidFuzz compares characters, not meaning, so different words describing identical concepts get low scores.

Solution:

Sentence Transformers comprehends conceptual similarity by analyzing word meanings.Sentence Transformers follows this process:
Creates embedding vectors that represent word concepts
Similar meanings produce similar embedding patterns
Compares these concept embeddings to identify semantically similar text
Recognizes synonyms and related terminology automatically

Full Article:

Build Semantic Text Matching with Sentence Transformers

View GitHub

☕️
Weekly Finds

tenacity

Testing & Reliability

Apache 2.0 licensed general-purpose retrying library for Python to simplify adding retry behavior to just about anything

ParadeDB

Database & Search

Modern Elasticsearch alternative built on Postgres for real-time, update-heavy workloads with full-text search capabilities

responses

Testing & Mocking

Utility library for mocking out the Python Requests library, making it easy to test HTTP API interactions

⭐
Related Post

Handle Messy Data with RapidFuzz Fuzzy Matching

Problem:

Traditional regex approaches require hours of preprocessing but still break with common data variations like missing spaces, typos, or inconsistent formatting.

Solution:

RapidFuzz eliminates data cleaning overhead with intelligent fuzzy matching.Key benefits:
Automatic handling of typos, spacing, and case variations
Production-ready C++ performance for large datasets
Full spectrum of fuzzy algorithms in one library

Full Article:

Handle Messy Data with RapidFuzz Fuzzy Matching

View GitHub

Favorite

Newsletter #211: Secure Database Queries with DuckDB Parameters Read More »

Code example: MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

Newsletter #210: MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Feature Engineering Without Complex Nested Loops

Problem:

Nested loops for sequence permutations create exponential complexity that becomes unmanageable as data grows.

Solution:

The itertools.permutations() function automatically generates all ordered arrangements of items from your sequences.Perfect for generating interaction features that preserve temporal or logical ordering in your feature set.

Full Article:

Feature Engineering Without Complex Nested Loops

MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

Problem:

Have you ever wanted to convert PDFs to text for analysis and search but find it hard to do so?While there are many tools to convert PDFs to text, they often lose structure and readability.

Solution:

Microsoft MarkItDown preserves document structure while converting PDFs to clean markdown format.The library handles multiple file types and maintains formatting hierarchy:
Clean markdown output with preserved headers and structure
Support for PDFs, Word docs, PowerPoint, and Excel files
Simple three-line implementation for any document type
Seamless integration with existing RAG pipelines

Full Article:

MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

View GitHub

☕️
Weekly Finds

scalene

Performance & Profiling

A high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals

bandit

Security & Code Quality

A tool designed to find common security issues in Python code through static code analysis

river

Machine Learning

Online machine learning in Python – enabling incremental learning algorithms for streaming data

⭐
Related Post

Transform PDFs to Pandas with Docling’s Complete Pipeline

Problem:

Most PDF processing tools force you to stitch together multiple solutions – one for extraction, another for parsing, and yet another for chunking. Each step introduces potential data loss and format incompatibilities, making document processing complex and error-prone.

Solution:

Docling handles the entire workflow from raw PDFs to structured, searchable content in a single solution.Key features:
Universal format support for PDF, DOCX, PPTX, HTML, and images
AI-powered extraction with TableFormer and Vision models
Direct export to pandas DataFrames, JSON, and Markdown
RAG-ready output maintains context and structure

Full Article:

Transform PDFs to Pandas with Docling’s Complete Pipeline

View GitHub

Favorite

Newsletter #210: MarkItDown: Convert PDFs to Clean Markdown in 3 Lines Read More »

Code example: Transform PDFs to Pandas with Docling's Complete Pipeline

Newsletter #209: Transform PDFs to Pandas with Docling’s Complete Pipeline

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Transform PDFs to Pandas with Docling’s Complete Pipeline

Problem:

Most PDF processing tools force you to stitch together multiple solutions – one for extraction, another for parsing, and yet another for chunking.Each step introduces potential data loss and format incompatibilities, making document processing complex and error-prone.

Solution:

Full Article:

Transform PDFs to Pandas with Docling’s Complete Pipeline

☕️
Weekly Finds

semantic-kernel

AI Orchestration

Model-agnostic SDK that empowers developers to build, orchestrate, and deploy AI agents and multi-agent systems with enterprise-grade reliability.

transformers

Machine Learning

The model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

whisper

Speech Recognition

Robust Speech Recognition via Large-Scale Weak Supervision. A multitasking model for multilingual speech recognition, translation, and language identification.

⭐
Related Post

Handle Messy Data with RapidFuzz Fuzzy Matching

Problem:

Traditional regex approaches require hours of preprocessing but still break with common data variations like missing spaces, typos, or inconsistent formatting.

Solution:

Full Article:

Handle Messy Data with RapidFuzz Fuzzy Matching

View GitHub

Favorite

Newsletter #209: Transform PDFs to Pandas with Docling’s Complete Pipeline Read More »

Code example: Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

Newsletter #208: Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

Problem:

Data prototyping typically requires loading entire datasets into memory first before sampling.A 1-million-row dataset consumes 7.6 MB of memory even when you only need 10 rows for initial feature exploration, creating unnecessary resource overhead.

Solution:

Use itertools.islice() to extract slices from iterators without loading full datasets into memory first.Key benefits:
Memory-efficient data sampling
Faster prototyping workflows
Less computational load on laptops

Full Article:

Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

From pandas Full Reloads to Delta Lake Incremental Updates

Problem:

Processing entire datasets when you only need to add a few new records wastes time and memory.Pandas lacks incremental append capabilities, requiring full dataset reload for data updates.

Solution:

Full Article:

From pandas Full Reloads to Delta Lake Incremental Updates

View GitHub

☕️
Weekly Finds

Semantic Kernel

AI Framework

Model-agnostic SDK that empowers developers to build, orchestrate, and deploy AI agents and multi-agent systems with enterprise-grade reliability

Ray

Distributed Computing

AI compute engine with core distributed runtime and AI Libraries for accelerating ML workloads from laptop to cluster

Apache Airflow

Workflow Orchestration

Platform for developing, scheduling, and monitoring workflows with powerful data pipeline orchestration capabilities

Favorite

Newsletter #208: Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling Read More »

Code example: Build Automated Chart Analysis with Hugging Face SmolVLM

Newsletter #207: Build Automated Chart Analysis with Hugging Face SmolVLM

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Build Automated Chart Analysis with Hugging Face SmolVLM

Problem:

Data teams spend hours manually analyzing charts and extracting insights from complex visualizations.Manual chart analysis creates bottlenecks in decision-making workflows and reduces time available for strategic insights.

Solution:

Hugging Face’s SmolVLM transforms this workflow by instantly generating insights, allowing analysts to focus on validation, strategic context, and decision-making rather than basic pattern recognition.The complete workflow could look like this:
Automated chart interpretation using vision language models
Expert review and validation of AI findings
Strategic context addition by domain specialists

Full Article:

Build Automated Chart Analysis with Hugging Face SmolVLM

View GitHub

Hydra Multi-run: Test All Parameters in One Command

Problem:

When you run a Python script with different preprocessing strategies and hyperparameter combinations, waiting for each variation to complete before testing the next is time-consuming.

Solution:

Hydra multi-run executes all parameter combinations in a single command, saving you time and effort.Plus, Hydra offers:
YAML-based configuration management
Override parameters from the command line
Compose configs from multiple files
Environment-specific configuration switching

Full Article:

Hydra Multi-run: Test All Parameters in One Command

View GitHub

☕️
Weekly Finds

Scrapegraph-ai

Data Extraction

Python scraper based on AI

Marker

Document Processing

Convert PDF to markdown quickly with high accuracy

EdgeDB

Database

A graph-relational database with declarative schema, built-in migration system, and a next-generation query language

Favorite

Newsletter #207: Build Automated Chart Analysis with Hugging Face SmolVLM Read More »

Code example: Handle Messy Data with RapidFuzz Fuzzy Matching

Newsletter #206: Handle Messy Data with RapidFuzz Fuzzy Matching

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Handle Messy Data with RapidFuzz Fuzzy Matching

Problem:

Traditional regex approaches require hours of preprocessing but still break with common data variations like missing spaces, typos, or inconsistent formatting.

Solution:

Full Article:

Handle Messy Data with RapidFuzz Fuzzy Matching

View GitHub

⭐
Related Post

Build Fuzzy Text Matching with difflib Over regex

Problem:

Have you ever spent hours cleaning text data with regex, only to find that “iPhone 14 Pro Max” still doesn’t match “iPhone 14 Prro Max”?Regex preprocessing achieves only exact matching after cleaning, failing completely with typos and character variations that exact matching cannot handle.

Solution:

difflib provides similarity scoring that tolerates typos and character variations, enabling approximate matching where regex fails.The library calculates similarity ratios between strings:
Handles typos like “Prro” vs “Pro” automatically
Returns similarity scores from 0.0 to 1.0 for ranking matches
Works with character-level variations without preprocessing
Enables fuzzy matching for real-world messy data
Perfect for product matching, name deduplication, and any scenario where exact matches aren’t realistic.

Full Article:

Build Fuzzy Text Matching with difflib Over regex

Favorite

Newsletter #206: Handle Messy Data with RapidFuzz Fuzzy Matching Read More »

Code example: Build Debuggable Tests: One Assertion Per Function

Newsletter #205: Build Debuggable Tests: One Assertion Per Function

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Ruff: Stop AI Code Complexity Before It Hits Production

Problem:

AI agents often create overengineered code with multiple nested if/else and try/except blocks, increasing technical debt and making functions difficult to test.However, it is time-consuming to check each function manually.

Solution:

Ruff’s C901 complexity check automatically flags overly complex functions before they enter your codebase.This tool counts decision points (if/else, loops) that create multiple execution paths in your code.Key benefits:
Automatic detection of complex functions during development
Configurable complexity thresholds for your team standards
Integration with pre-commit hooks for automated validation
Clear error messages showing exact complexity scores
No more manual code reviews to catch overengineered functions.

Full Article:

Ruff: Stop AI Code Complexity Before It Hits Production

Build Debuggable Tests: One Assertion Per Function

Problem:

Tests with multiple assertions make debugging harder.When a test fails, you can’t tell which assertion broke without examining the code.

Solution:

Create multiple specific test functions for different scenarios of the same function.Follow these practices for focused test functions:
One assertion per test function for clear failure points
Use descriptive test names that explain the expected behavior
Maintain consistent naming patterns across your test suite
This approach makes your test suite more maintainable and failures easier to diagnose.

Full Article:

Build Debuggable Tests: One Assertion Per Function

Favorite

Newsletter #205: Build Debuggable Tests: One Assertion Per Function Read More »

Code example: Build Fuzzy Text Matching with difflib Over regex

Newsletter #204: Build Fuzzy Text Matching with difflib Over regex

Leave a Comment / Newsletter Archive / Khuyen Tran

📅
Today’s Picks

Build Fuzzy Text Matching with difflib Over regex

Problem:

Solution:

Full Article:

Build Fuzzy Text Matching with difflib Over regex

Build Portable Python Scripts with uv PEP 723

Problem:

Python scripts break when moved between environments because dependencies are scattered across requirements.txt files, virtual environments, or undocumented assumptions.

Solution:

uv enables PEP 723 inline script dependencies, embedding all requirements directly in the script header for true portability.Use uv add –script script.py dependency to automatically add metadata to any Python file.Key benefits:
Self-contained scripts with zero external files
Easy command-line dependency management
Perfect for sharing data analysis code across teams

Full Article:

Build Portable Python Scripts with uv PEP 723

Favorite

Newsletter #204: Build Fuzzy Text Matching with difflib Over regex Read More »