Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Newsletter #210: MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

Newsletter #210: MarkItDown: Convert PDFs to Clean Markdown in 3 Lines


🤝 COLLABORATION

How to write better DAGs in Airflow

How to write better DAGs in Airflow

DAGs (Directed Acyclic Graphs) are Airflow’s workflow definition format. They specify how data tasks connect and execute in sequence.

Well-designed DAGs handle edge cases, scale with data volume changes, and remain maintainable as your pipeline complexity grows.

What you’ll learn:

  • Design DAGs that are easier to read, test, and maintain
  • Make your pipelines adapt to your data at runtime with dynamic task mapping
  • Avoid common pitfalls that can cause performance issues
  • Create data-aware pipelines with XComs and event-driven scheduling
  • Learn proven DAG writing best practices including Airflow 3’s latest features

This covers practical patterns for building production-ready workflows that handle failures gracefully and scale with your data infrastructure needs.

Speakers:

  • Kenten Danas – Senior Manager, Developer Relations at Astronomer
  • Tamara Fingerlin – Developer Advocate at Astronomer

Register here


📅 Today’s Picks

Feature Engineering Without Complex Nested Loops

Code example: Feature Engineering Without Complex Nested Loops

Problem

Nested loops for sequence permutations create exponential complexity that becomes unmanageable as data grows.

Solution

The itertools.permutations() function automatically generates all ordered arrangements of items from your sequences.

Perfect for generating interaction features that preserve temporal or logical ordering in your feature set.


MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

Code example: MarkItDown: Convert PDFs to Clean Markdown in 3 Lines

Problem

Have you ever wanted to convert PDFs to text for analysis and search but find it hard to do so?

While there are many tools to convert PDFs to text, they often lose structure and readability.

Solution

Microsoft MarkItDown preserves document structure while converting PDFs to clean markdown format.

The library handles multiple file types and maintains formatting hierarchy:

  • Clean markdown output with preserved headers and structure
  • Support for PDFs, Word docs, PowerPoint, and Excel files
  • Simple three-line implementation for any document type
  • Seamless integration with existing RAG pipelines

☕️ Weekly Finds

scalene [Performance & Profiling] – A high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals

bandit [Security & Code Quality] – A tool designed to find common security issues in Python code through static code analysis

river [Machine Learning] – Online machine learning in Python – enabling incremental learning algorithms for streaming data

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran