Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Newsletter Archive

Automated newsletter archive from Klaviyo campaigns

Code example: Milvus: Unified Search Across Text, Images, and Audio (Sponsored)

Newsletter #216: Milvus: Unified Search Across Text, Images, and Audio

📅
Today’s Picks

Create Compelling Animated Visualizations with Matplotlib Animation

Problem:

Static charts can’t reveal how data patterns and relationships change over time.

Solution:

With Matplotlib’s animation module, you can transform static plots into dynamic, interactive data stories.Some use cases of Matplotlib animation:
Time series data visualization showing trends over periods
Machine learning model convergence and training progress
Scientific simulations and mathematical function behavior
Business metrics dashboards with real-time updates

Full Article:

Top 6 Python Libraries for Visualization: Which One to Use

Run Code

View GitHub

Milvus: Unified Search Across Text, Images, and Audio

Problem:

It is a pain to search across text documents, images, and audio files in different search systems. Traditional search engines excel at text but struggle with visual content, while media-specific tools can’t understand textual context.

Solution:

Milvus supports multi-modal search by storing embeddings from different data types in a single collection. This allows you to query text, images, and audio simultaneously.Here’s how Milvus works:
Generate embeddings for text, images, and audio using specialized models
Store all embeddings in unified Milvus collection with metadata
Execute similarity searches across all content types simultaneously
Return ranked results regardless of original data format

Run Code

View GitHub

☕️
Weekly Finds

phoenix

MLOps

Open-source AI observability platform for experimentation, evaluation, and troubleshooting of LLM applications

mesop

Python Utils

Python-based UI framework for rapidly building web apps and ML/AI demos

crawlee-python

Python Utils

Web scraping and browser automation library

Favorite

Newsletter #216: Milvus: Unified Search Across Text, Images, and Audio Read More »

Code example: All or Nothing: DuckDB Transaction Guarantee

Newsletter #215: All or Nothing: DuckDB Transaction Guarantee

🤝
COLLABORATION

Beyond Analytics: Get Apache Airflow® 3 certified (for free)

On September 16, Beyond Analytics kicks off with a live Airflow 3 Certification Crash Course, where you can ask questions and prepare for the Airflow 3 certification exam.Join “Data with Marc’s” creator Marc Lamberti for a live session where you will:
Learn about the Airflow 3 features that will be covered in the exam, such as scheduling, DAG versioning, and backfills
Get your certification questions answered live
Receive a $150 voucher for the official Airflow 3 certification exam

Register here

📅
Today’s Picks

All or Nothing: DuckDB Transaction Guarantee

Problem:

Data operations can fail partway through, leaving databases in inconsistent states.Money transfers, inventory updates, and other critical operations need guaranteed atomicity.

Solution:

DuckDB uses ACID transactions to maintain data integrity. Operations either complete fully or roll back completely using BEGIN, COMMIT, and ROLLBACK commands.Why ACID transactions matter:
Atomicity: prevents half-completed operations
Consistency: maintains database integrity rules
Isolation: stops concurrent operations from conflicting
Durability: ensures committed data survives system failures

Full Article:

A Deep Dive into DuckDB for Data Scientists

Run Code

View GitHub


Related Post

Secure Database Queries with DuckDB Parameters

Problem:

F-strings create SQL injection vulnerabilities by inserting values directly into queries.

Solution:

DuckDB’s parameterized queries use placeholders to safely pass parameters and prevent SQL injection attacks.Other key features of DuckDB:
In-Process Analytics – No external database needed
Fast Performance – Columnar storage for speed
Zero Setup – Works instantly in Python
DataFrame Integration – Native pandas support

Full Article:

A Deep Dive into DuckDB for Data Scientists

Run Code

View GitHub

☕️
Weekly Finds

gpt-migrate

AI Tools

Easily migrate your codebase from one framework or language to another using AI

lmql

LLM

A query language for programming large language models with structured outputs

respx

Python Utils

Mock HTTPX with awesome request patterns and response side effects for testing

Favorite

Newsletter #215: All or Nothing: DuckDB Transaction Guarantee Read More »

Code example: Create Compelling Animated Visualizations with Matplotlib Animation

Newsletter #214: Create Compelling Animated Visualizations with Matplotlib Animation

📅
Today’s Picks

Create Compelling Animated Visualizations with Matplotlib Animation

Problem:

Static charts can’t reveal how data patterns and relationships change over time.

Solution:

With Matplotlib’s animation module, you can transform static plots into dynamic, interactive data stories.Some use cases of Matplotlib animation:
Time series data visualization showing trends over periods
Machine learning model convergence and training progress
Scientific simulations and mathematical function behavior
Business metrics dashboards with real-time updates

Full Article:

Create Compelling Animated Visualizations with Matplotlib Animation

Run Code

View GitHub

☕️
Weekly Finds

implicit

Machine Learning

Fast Python implementations of several different popular recommendation algorithms for implicit feedback datasets

developer

AI Development

AI-powered code generation tool designed to automate software development processes and build entire codebases with prompts

datasets-server

Data Infrastructure

Backend API for visualizing and exploring all types of datasets – computer vision, speech, text, and tabular – stored on Hugging Face Hub

Favorite

Newsletter #214: Create Compelling Animated Visualizations with Matplotlib Animation Read More »

Code example: Query GitHub Issues with Natural Language Using LangChain

Newsletter #213: Query GitHub Issues with Natural Language Using LangChain

📅
Today’s Picks

Query GitHub Issues with Natural Language Using LangChain

Problem:

Have you ever spent hours clicking through GitHub pages to understand project status, track bugs, or review recent changes? Manual repository analysis wastes development time that could be spent building features.

Solution:

LangChain’s GitHubIssuesLoader converts repository issues and PRs into searchable content that responds to natural language questions about bugs, features, and project status.This method integrates seamlessly with LangChain workflows.

Full Article:

Run Private AI Workflows with LangChain and Ollama

Run Code

View GitHub

Mock External APIs for Fast, Reliable Tests

Problem:

Testing with real APIs and databases is slow, expensive, and unreliable.External dependencies create flaky tests that can fail due to network issues, rate limits, or service downtime rather than code problems.

Solution:

The patch decorator replaces external calls with controllable mock objects for isolated testing.Key benefits:
Reproducible results across different machines
Fast, reliable tests that focus on your logic
Test edge cases and error conditions that are hard to trigger naturally
Test your data processing logic without waiting for external services or consuming API quotas.

Full Article:

Pytest for Data Scientists

Run Code

☕️
Weekly Finds

filprofiler

Performance Profiling

A Python memory profiler for data processing applications with native Jupyter support

organize

Automation

The file management automation tool for sorting, renaming, and organizing files

plotnine

Data Visualization

A Grammar of Graphics for Python based on ggplot2 for data visualization

Favorite

Newsletter #213: Query GitHub Issues with Natural Language Using LangChain Read More »

Code example: Delta Lake: Never Lose Data to Failed Writes Again

Newsletter #212: Delta Lake: Never Lose Data to Failed Writes Again

📅
Today’s Picks

Delta Lake: Never Lose Data to Failed Writes Again

Problem:

Have you ever had a pandas operation fail midway through writing data, leaving you with corrupted datasets?Partial writes create inconsistent data states that can break downstream analysis and reporting workflows.

Solution:

Delta Lake provides ACID transactions that guarantee all-or-nothing writes with automatic rollback on failures.ACID properties:
Atomicity: Complete transaction success or automatic rollback
Consistency: Data consistency guaranteed
Isolation: Safe concurrent operations
Durability: Version history with time travel

Full Article:

Delta Lake: Never Lose Data to Failed Writes Again

View GitHub

☕️
Weekly Finds

TinyDB

Database

Lightweight, document-oriented database written in pure Python with no external dependencies. Designed to be simple and developer-friendly, storing data in JSON format by default.

ollama-python

LLM

Python library that provides the easiest way to integrate Python 3.8+ projects with Ollama, an open-source large language model platform. Offers both synchronous and asynchronous client interfaces for seamless AI model interaction.

PyMC

ML

Python package for Bayesian statistical modeling that focuses on advanced Markov chain Monte Carlo (MCMC) and variational inference (VI) algorithms. Enables researchers and data scientists to build sophisticated Bayesian models with minimal algorithmic complexity.


Related Post

From pandas Full Reloads to Delta Lake Incremental Updates

Problem:

Processing entire datasets when you only need to add a few new records wastes time and memory.Pandas lacks incremental append capabilities, requiring full dataset reload for data updates.

Solution:

Delta Lake’s append mode processes only new data without touching existing records.Key advantages:
Append new records without full dataset reload
Memory usage scales with new data size, not total dataset size
Automatic data protection prevents corruption during updates
Time travel enables rollback to previous dataset versions
Perfect for production data pipelines that need reliable incremental updates.

Full Article:

From pandas Full Reloads to Delta Lake Incremental Updates

View GitHub

Favorite

Newsletter #212: Delta Lake: Never Lose Data to Failed Writes Again Read More »

Code example: Secure Database Queries with DuckDB Parameters

Newsletter #211: Secure Database Queries with DuckDB Parameters

📅
Today’s Picks

Secure Database Queries with DuckDB Parameters

Problem:

F-strings create SQL injection vulnerabilities by inserting values directly into queries.

Solution:

DuckDB’s parameterized queries use placeholders to safely pass parameters and prevent SQL injection attacks.Other key features of DuckDB:
In-Process Analytics – No external database needed
Fast Performance – Columnar storage for speed
Zero Setup – Works instantly in Python
DataFrame Integration – Native pandas support

Full Article:

Secure Database Queries with DuckDB Parameters

View GitHub

Build Semantic Text Matching with Sentence Transformers

Problem:

RapidFuzz, which I introduced in my previous post, excels at lightning-fast string matching.However, it cannot understand semantic relationships. It scores ‘running shoes’ vs ‘athletic footwear’ at only 0.267 despite describing similar product categories.RapidFuzz compares characters, not meaning, so different words describing identical concepts get low scores.

Solution:

Sentence Transformers comprehends conceptual similarity by analyzing word meanings.Sentence Transformers follows this process:
Creates embedding vectors that represent word concepts
Similar meanings produce similar embedding patterns
Compares these concept embeddings to identify semantically similar text
Recognizes synonyms and related terminology automatically

Full Article:

Build Semantic Text Matching with Sentence Transformers

View GitHub

☕️
Weekly Finds

tenacity

Testing & Reliability

Apache 2.0 licensed general-purpose retrying library for Python to simplify adding retry behavior to just about anything

ParadeDB

Database & Search

Modern Elasticsearch alternative built on Postgres for real-time, update-heavy workloads with full-text search capabilities

responses

Testing & Mocking

Utility library for mocking out the Python Requests library, making it easy to test HTTP API interactions


Related Post

Handle Messy Data with RapidFuzz Fuzzy Matching

Problem:

Traditional regex approaches require hours of preprocessing but still break with common data variations like missing spaces, typos, or inconsistent formatting.

Solution:

RapidFuzz eliminates data cleaning overhead with intelligent fuzzy matching.Key benefits:
Automatic handling of typos, spacing, and case variations
Production-ready C++ performance for large datasets
Full spectrum of fuzzy algorithms in one library

Full Article:

Handle Messy Data with RapidFuzz Fuzzy Matching

View GitHub

Favorite

Newsletter #211: Secure Database Queries with DuckDB Parameters Read More »

Code example: MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

Newsletter #210: MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

📅
Today’s Picks

Feature Engineering Without Complex Nested Loops

Problem:

Nested loops for sequence permutations create exponential complexity that becomes unmanageable as data grows.

Solution:

The itertools.permutations() function automatically generates all ordered arrangements of items from your sequences.Perfect for generating interaction features that preserve temporal or logical ordering in your feature set.

Full Article:

Feature Engineering Without Complex Nested Loops

MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

Problem:

Have you ever wanted to convert PDFs to text for analysis and search but find it hard to do so?While there are many tools to convert PDFs to text, they often lose structure and readability.

Solution:

Microsoft MarkItDown preserves document structure while converting PDFs to clean markdown format.The library handles multiple file types and maintains formatting hierarchy:
Clean markdown output with preserved headers and structure
Support for PDFs, Word docs, PowerPoint, and Excel files
Simple three-line implementation for any document type
Seamless integration with existing RAG pipelines

Full Article:

MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

View GitHub

☕️
Weekly Finds

scalene

Performance & Profiling

A high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals

bandit

Security & Code Quality

A tool designed to find common security issues in Python code through static code analysis

river

Machine Learning

Online machine learning in Python – enabling incremental learning algorithms for streaming data


Related Post

Transform PDFs to Pandas with Docling’s Complete Pipeline

Problem:

Most PDF processing tools force you to stitch together multiple solutions – one for extraction, another for parsing, and yet another for chunking. Each step introduces potential data loss and format incompatibilities, making document processing complex and error-prone.

Solution:

Docling handles the entire workflow from raw PDFs to structured, searchable content in a single solution.Key features:
Universal format support for PDF, DOCX, PPTX, HTML, and images
AI-powered extraction with TableFormer and Vision models
Direct export to pandas DataFrames, JSON, and Markdown
RAG-ready output maintains context and structure

Full Article:

Transform PDFs to Pandas with Docling’s Complete Pipeline

View GitHub

Favorite

Newsletter #210: MarkItDown: Convert PDFs to Clean Markdown in 3 Lines Read More »

Code example: Transform PDFs to Pandas with Docling's Complete Pipeline

Newsletter #209: Transform PDFs to Pandas with Docling’s Complete Pipeline

📅
Today’s Picks

Transform PDFs to Pandas with Docling’s Complete Pipeline

Problem:

Most PDF processing tools force you to stitch together multiple solutions – one for extraction, another for parsing, and yet another for chunking.Each step introduces potential data loss and format incompatibilities, making document processing complex and error-prone.

Solution:

Docling handles the entire workflow from raw PDFs to structured, searchable content in a single solution.Key features:
Universal format support for PDF, DOCX, PPTX, HTML, and images
AI-powered extraction with TableFormer and Vision models
Direct export to pandas DataFrames, JSON, and Markdown
RAG-ready output maintains context and structure

Full Article:

Transform PDFs to Pandas with Docling’s Complete Pipeline

☕️
Weekly Finds

semantic-kernel

AI Orchestration

Model-agnostic SDK that empowers developers to build, orchestrate, and deploy AI agents and multi-agent systems with enterprise-grade reliability.

transformers

Machine Learning

The model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

whisper

Speech Recognition

Robust Speech Recognition via Large-Scale Weak Supervision. A multitasking model for multilingual speech recognition, translation, and language identification.


Related Post

Handle Messy Data with RapidFuzz Fuzzy Matching

Problem:

Traditional regex approaches require hours of preprocessing but still break with common data variations like missing spaces, typos, or inconsistent formatting.

Solution:

RapidFuzz eliminates data cleaning overhead with intelligent fuzzy matching.Key benefits:
Automatic handling of typos, spacing, and case variations
Production-ready C++ performance for large datasets
Full spectrum of fuzzy algorithms in one library

Full Article:

Handle Messy Data with RapidFuzz Fuzzy Matching

View GitHub

Favorite

Newsletter #209: Transform PDFs to Pandas with Docling’s Complete Pipeline Read More »

Code example: Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

Newsletter #208: Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

📅
Today’s Picks

Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

Problem:

Data prototyping typically requires loading entire datasets into memory first before sampling.A 1-million-row dataset consumes 7.6 MB of memory even when you only need 10 rows for initial feature exploration, creating unnecessary resource overhead.

Solution:

Use itertools.islice() to extract slices from iterators without loading full datasets into memory first.Key benefits:
Memory-efficient data sampling
Faster prototyping workflows
Less computational load on laptops

Full Article:

Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling

From pandas Full Reloads to Delta Lake Incremental Updates

Problem:

Processing entire datasets when you only need to add a few new records wastes time and memory.Pandas lacks incremental append capabilities, requiring full dataset reload for data updates.

Solution:

Delta Lake’s append mode processes only new data without touching existing records.Key advantages:
Append new records without full dataset reload
Memory usage scales with new data size, not total dataset size
Automatic data protection prevents corruption during updates
Time travel enables rollback to previous dataset versions
Perfect for production data pipelines that need reliable incremental updates.

Full Article:

From pandas Full Reloads to Delta Lake Incremental Updates

View GitHub

☕️
Weekly Finds

Semantic Kernel

AI Framework

Model-agnostic SDK that empowers developers to build, orchestrate, and deploy AI agents and multi-agent systems with enterprise-grade reliability

Ray

Distributed Computing

AI compute engine with core distributed runtime and AI Libraries for accelerating ML workloads from laptop to cluster

Apache Airflow

Workflow Orchestration

Platform for developing, scheduling, and monitoring workflows with powerful data pipeline orchestration capabilities

Favorite

Newsletter #208: Stop Loading Full Datasets: Use itertools.islice() for Smart Sampling Read More »

Code example: Build Automated Chart Analysis with Hugging Face SmolVLM

Newsletter #207: Build Automated Chart Analysis with Hugging Face SmolVLM

📅
Today’s Picks

Build Automated Chart Analysis with Hugging Face SmolVLM

Problem:

Data teams spend hours manually analyzing charts and extracting insights from complex visualizations.Manual chart analysis creates bottlenecks in decision-making workflows and reduces time available for strategic insights.

Solution:

Hugging Face’s SmolVLM transforms this workflow by instantly generating insights, allowing analysts to focus on validation, strategic context, and decision-making rather than basic pattern recognition.The complete workflow could look like this:
Automated chart interpretation using vision language models
Expert review and validation of AI findings
Strategic context addition by domain specialists

Full Article:

Build Automated Chart Analysis with Hugging Face SmolVLM

View GitHub

Hydra Multi-run: Test All Parameters in One Command

Problem:

When you run a Python script with different preprocessing strategies and hyperparameter combinations, waiting for each variation to complete before testing the next is time-consuming.

Solution:

Hydra multi-run executes all parameter combinations in a single command, saving you time and effort.Plus, Hydra offers:
YAML-based configuration management
Override parameters from the command line
Compose configs from multiple files
Environment-specific configuration switching

Full Article:

Hydra Multi-run: Test All Parameters in One Command

View GitHub

☕️
Weekly Finds

Scrapegraph-ai

Data Extraction

Python scraper based on AI

Marker

Document Processing

Convert PDF to markdown quickly with high accuracy

EdgeDB

Database

A graph-relational database with declarative schema, built-in migration system, and a next-generation query language

Favorite

Newsletter #207: Build Automated Chart Analysis with Hugging Face SmolVLM Read More »

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran