Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Newsletter #281: MarkItDown: From Images to Searchable Text in Seconds

Newsletter #281: MarkItDown: From Images to Searchable Text in Seconds

Grab your coffee. Here are this week’s highlights.


๐Ÿค COLLABORATION

What Data Engineers Really Think About Airflow (5.8K Surveyed)

What Data Engineers Really Think About Airflow (5.8K Surveyed)

Astronomer analyzed 5.8k+ responses from data engineers on how they are navigating Airflow today and the findings might surprise you.

You’ll learn:

  • How early adopters are using Airflow 3 features in production
  • Which teams are bringing AI into production and what’s holding others back
  • 35.6% believe that Airflow is beneficial to their career

๐Ÿ“… Today’s Picks

Query Multiple Databases at Once with DuckDB

Code example: Query Multiple Databases at Once with DuckDB

Problem

Working with data across PostgreSQL, MySQL, and SQLite often means managing multiple database connections and additional integration overhead.

That overhead adds up quickly when your goal is simply to analyze data across sources.

Solution

DuckDB removes the friction by allowing you to join tables across databases with a single query.

Key benefits:

  • Join SQLite, PostgreSQL, MySQL, and Parquet files in a single SQL statement
  • Automatic connection handling across all sources
  • Filters run at the source database, so only matching rows are transferred

MarkItDown: From Images to Searchable Text in Seconds

Code example: MarkItDown: From Images to Searchable Text in Seconds

Problem

Charts, diagrams, and screenshots in your documents need text descriptions to be searchable and processable.

But writing descriptions manually is slow and produces inconsistent results across large document sets.

Solution

MarkItDown, an open-source library from Microsoft, integrates with OpenAI to automatically generate detailed descriptions of images.

Key capabilities:

  • Generate consistent descriptions across hundreds of images
  • Process images from documents like PowerPoint and PDF files
  • Customize the description prompt for your specific needs

โ˜•๏ธ Weekly Finds

Skill_Seekers [LLM] – Convert documentation websites, GitHub repositories, and PDFs into Claude AI skills with automatic conflict detection

sqlit [Data] – A user-friendly TUI for SQL databases supporting SQL Server, MySQL, PostgreSQL, SQLite, Turso and more

giskard [ML] – Open-source CI/CD platform for ML teams to eliminate AI bias and deliver quality ML products faster

Looking for a specific tool? Explore 70+ Python tools โ†’

๐Ÿ“š Latest Deep Dives

From CSS Selectors to Natural Language: Web Scraping with ScrapeGraphAI – Web scraping without selector maintenance. ScrapeGraphAI uses LLMs to extract data from any site using plain English prompts and Pydantic schemas.


Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran