Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Newsletter #294: pandas 3.0: 5-10x Faster String Operations with PyArrow

Newsletter #294: pandas 3.0: 5-10x Faster String Operations with PyArrow

Grab your coffee. Here are this week’s highlights.


๐Ÿ“… Today’s Picks

pandas 3.0: 5-10x Faster String Operations with PyArrow

Code example: pandas 3.0: 5-10x Faster String Operations with PyArrow

Problem

Traditionally, pandas stores strings as object dtype, where each string is a separate Python object scattered across memory.

This makes string operations slow and the dtype ambiguous, since both pure string columns and mixed-type columns show up as object.

Solution

pandas 3.0 introduces a dedicated str dtype backed by PyArrow, which stores strings in contiguous memory blocks instead of individual Python objects.

Key benefits:

  • 5-10x faster string operations because data is stored contiguously
  • 50% lower memory by eliminating Python object overhead
  • Clear distinction between string and mixed-type columns

Build Self-Documenting Regex with Pregex

Code example: Build Self-Documenting Regex with Pregex

Problem

Regex patterns like [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} are difficult to read and intimidating.

Team members without regex expertise might struggle to understand and modify these validation patterns.

Solution

Team members without regex expertise might struggle to understand and modify these validation patterns.

Pregex transforms regex into readable Python code using descriptive components.

Key benefits:

  • Code that explains its intent without comments
  • Easy modification without regex expertise
  • Composable patterns for complex validation
  • Export to regex format when needed

๐Ÿ“š Latest Deep Dives

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared

PDF files do not store tables as structured data. Instead, they position text at specific coordinates on the page.

Table extraction tools must reconstruct the structure by determining which values belong in which rows and columns.

The problem becomes even harder when tables include multi-level headers, merged cells, or complex layouts.

To explore this problem, I experimented with three tools designed for PDF table extraction: LlamaParse, Marker, and Docling. Each tool takes a different approach.

Performance overview:

  • Docling: Fastest local option, but struggles with complex tables
  • Marker: Handles complex layouts well and runs locally, but is much slower
  • LlamaParse: Most accurate on complex tables and fastest overall, but requires a cloud API

In this article, I share the code, examples, and results from testing each tool.

๐Ÿ“– View Full Article


โ˜•๏ธ Weekly Finds

Lance [Data Processing] – Modern columnar data format for ML with 100x faster random access than Parquet

Mathesar [Dashboard] – Spreadsheet-like interface for PostgreSQL that lets anyone view, edit, and query data

dotenvx [DevOps] – A better dotenv with encryption, multiple environments, and cross-platform support

Looking for a specific tool? Explore 70+ Python tools โ†’

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran