Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Newsletter Archive

Automated newsletter archive from Klaviyo campaigns

Code example: Handle Messy Data with RapidFuzz Fuzzy Matching

Newsletter #206: Handle Messy Data with RapidFuzz Fuzzy Matching

📅
Today’s Picks

Handle Messy Data with RapidFuzz Fuzzy Matching

Problem:

Traditional regex approaches require hours of preprocessing but still break with common data variations like missing spaces, typos, or inconsistent formatting.

Solution:

RapidFuzz eliminates data cleaning overhead with intelligent fuzzy matching.Key benefits:
Automatic handling of typos, spacing, and case variations
Production-ready C++ performance for large datasets
Full spectrum of fuzzy algorithms in one library

Full Article:

Handle Messy Data with RapidFuzz Fuzzy Matching

View GitHub


Related Post

Build Fuzzy Text Matching with difflib Over regex

Problem:

Have you ever spent hours cleaning text data with regex, only to find that “iPhone 14 Pro Max” still doesn’t match “iPhone 14 Prro Max”?Regex preprocessing achieves only exact matching after cleaning, failing completely with typos and character variations that exact matching cannot handle.

Solution:

difflib provides similarity scoring that tolerates typos and character variations, enabling approximate matching where regex fails.The library calculates similarity ratios between strings:
Handles typos like “Prro” vs “Pro” automatically
Returns similarity scores from 0.0 to 1.0 for ranking matches
Works with character-level variations without preprocessing
Enables fuzzy matching for real-world messy data
Perfect for product matching, name deduplication, and any scenario where exact matches aren’t realistic.

Full Article:

Build Fuzzy Text Matching with difflib Over regex

Favorite

Newsletter #206: Handle Messy Data with RapidFuzz Fuzzy Matching Read More »

Code example: Build Debuggable Tests: One Assertion Per Function

Newsletter #205: Build Debuggable Tests: One Assertion Per Function

📅
Today’s Picks

Ruff: Stop AI Code Complexity Before It Hits Production

Problem:

AI agents often create overengineered code with multiple nested if/else and try/except blocks, increasing technical debt and making functions difficult to test.However, it is time-consuming to check each function manually.

Solution:

Ruff’s C901 complexity check automatically flags overly complex functions before they enter your codebase.This tool counts decision points (if/else, loops) that create multiple execution paths in your code.Key benefits:
Automatic detection of complex functions during development
Configurable complexity thresholds for your team standards
Integration with pre-commit hooks for automated validation
Clear error messages showing exact complexity scores
No more manual code reviews to catch overengineered functions.

Full Article:

Ruff: Stop AI Code Complexity Before It Hits Production

Build Debuggable Tests: One Assertion Per Function

Problem:

Tests with multiple assertions make debugging harder.When a test fails, you can’t tell which assertion broke without examining the code.

Solution:

Create multiple specific test functions for different scenarios of the same function.Follow these practices for focused test functions:
One assertion per test function for clear failure points
Use descriptive test names that explain the expected behavior
Maintain consistent naming patterns across your test suite
This approach makes your test suite more maintainable and failures easier to diagnose.

Full Article:

Build Debuggable Tests: One Assertion Per Function

Favorite

Newsletter #205: Build Debuggable Tests: One Assertion Per Function Read More »

Code example: Build Fuzzy Text Matching with difflib Over regex

Newsletter #204: Build Fuzzy Text Matching with difflib Over regex

📅
Today’s Picks

Build Fuzzy Text Matching with difflib Over regex

Problem:

Have you ever spent hours cleaning text data with regex, only to find that “iPhone 14 Pro Max” still doesn’t match “iPhone 14 Prro Max”?Regex preprocessing achieves only exact matching after cleaning, failing completely with typos and character variations that exact matching cannot handle.

Solution:

difflib provides similarity scoring that tolerates typos and character variations, enabling approximate matching where regex fails.The library calculates similarity ratios between strings:
Handles typos like “Prro” vs “Pro” automatically
Returns similarity scores from 0.0 to 1.0 for ranking matches
Works with character-level variations without preprocessing
Enables fuzzy matching for real-world messy data
Perfect for product matching, name deduplication, and any scenario where exact matches aren’t realistic.

Full Article:

Build Fuzzy Text Matching with difflib Over regex

Build Portable Python Scripts with uv PEP 723

Problem:

Python scripts break when moved between environments because dependencies are scattered across requirements.txt files, virtual environments, or undocumented assumptions.

Solution:

uv enables PEP 723 inline script dependencies, embedding all requirements directly in the script header for true portability.Use uv add –script script.py dependency to automatically add metadata to any Python file.Key benefits:
Self-contained scripts with zero external files
Easy command-line dependency management
Perfect for sharing data analysis code across teams

Full Article:

Build Portable Python Scripts with uv PEP 723

Favorite

Newsletter #204: Build Fuzzy Text Matching with difflib Over regex Read More »

Code example: Semantic Search Without Complex Setup Headaches

Newsletter #203: Semantic Search Without Complex Setup Headaches

📅
Today’s Picks

Semantic Search Without Complex Setup Headaches

Problem:

Have you ever found yourself looking up SQL syntax when you just want to query your database?Complex joins and subqueries create friction between you and your data insights.

Solution:

The semantic search workflow connects natural language questions to your existing PostgreSQL tables.The complete workflow includes:
Database setup with PostgreSQL and pgvector extension
Content preprocessing for optimal embeddings
Embedding pipeline using Ollama models
Vector storage with SQLAlchemy integration
Query interface for natural language searches
Response generation combining retrieval and LLMs
Query your database with plain English instead of SQL syntax.

Full Article:

Semantic Search Without Complex Setup Headaches

Favorite

Newsletter #203: Semantic Search Without Complex Setup Headaches Read More »

Automate Code Quality Without Manual Checking example

Newsletter #202: Automate Code Quality Without Manual Checking

📅
Today’s Picks

Automate Code Quality Without Manual Checking

Problem:

Code quality is essential for data science projects, but manual checking consumes valuable time that could be spent on analysis and insights.

Solution:

Pre-commit automates code quality validation before every commit.Key benefits:
Automatic formatting validation
Comprehensive linting checks
Type checking before commits
And all you need is a simple .pre-commit-config.yaml configuration file.

Full Article:

Automate Code Quality Without Manual Checking

Deploy ML Models Without Docker Hub Costs

Problem:

Docker Hub forces you into an expensive choice: pay mounting fees for private repositories or risk exposing your proprietary code publicly.Plus, Docker transfers entire multi-gigabyte images even for small code changes, wasting time and bandwidth.

Solution:

Unregistry eliminates registries entirely with docker pussh – push images directly to remote servers over SSH.Key benefits:
Smart transfers: only sends changed parts, not the whole image
No registry infrastructure to set up or maintain
Works with existing SSH connections
Faster deployments by avoiding duplicate data transfers

Full Article:

Deploy ML Models Without Docker Hub Costs

Favorite

Newsletter #202: Automate Code Quality Without Manual Checking Read More »

itertools.combinations() for Feature Interactions example

Newsletter #201: itertools.combinations() for Feature Interactions

📅
Today’s Picks

itertools.combinations() for Feature Interactions

Problem:

Writing nested loops for all feature pair combinations gets messy with more features and easily introduces bugs.

Solution:

itertools.combinations() automatically generates all unique pairs without the complexity and bugs.

Full Article:

itertools.combinations() for Feature Interactions

Production-Ready RAG Evaluation Workflow

Problem:

Many teams deploy RAG systems without systematic evaluation, missing critical quality issues that only become visible with real users.

Solution:

MLflow evaluation framework validates RAG systems through systematic checks:
Faithfulness metrics – Ensures answers align with retrieved documents
Answer relevancy scoring – Matches responses to user queries
Context recall – Verifies all relevant information was retrieved from documents

Full Article:

Production-Ready RAG Evaluation Workflow

Favorite

Newsletter #201: itertools.combinations() for Feature Interactions Read More »

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran