Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

automation

Auto-created tag for automation

From CSS Selectors to Natural Language: Web Scraping with ScrapeGraphAI

Table of Contents

Introduction
What is ScrapeGraphAI?
Setup
Installation
OpenAI Configuration
Local Models with Ollama

Natural Language Prompts
Structured Output with Pydantic
JavaScript Content
Multi-Page Scraping
Key Takeaways

Introduction
BeautifulSoup is the go-to library for web scraping thanks to its simple API and flexible parsing. The workflow is straightforward: fetch HTML, inspect elements in DevTools, and write selectors to extract data:
from pprint import pprint

from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

books = []
for article in soup.select("article.product_pod"):
title = article.select_one("h3 a")["title"]
price = article.select_one("p.price_color").text
books.append({"title": title, "price": price})

pprint(books[:3])

Output:
[{'price': '£51.77', 'title': 'A Light in the Attic'},
{'price': '£53.74', 'title': 'Tipping the Velvet'},
{'price': '£50.10', 'title': 'Soumission'}]

The output is correct, but selectors are tightly coupled to the HTML structure. This means when the site redesigns, everything breaks, so you spend more time maintaining selectors than extracting data:
# Before: <article class="product_pod">
# After: <div class="book-card">
soup.select("article.product_pod") # Now returns []

# Before: <p class="price_color">£51.77</p>
# After: <span class="price">£51.77</span>
soup.select_one("p.price_color") # Returns None, crashes on .text

What if you could just describe the data you want and let an LLM figure out the extraction? That’s where ScrapeGraphAI comes in.

💻 Get the Code: The complete source code for this tutorial are available on GitHub. Clone it to follow along!

What is ScrapeGraphAI?
ScrapeGraphAI is an open-source Python library for LLM-powered web scraping. Rather than writing CSS selectors, you describe the data you want in plain English.
Key benefits:

No selector maintenance: Describe what data you want, not where it lives in the HTML
Self-healing scrapers: The LLM adjusts automatically when websites redesign
Structured output: Define Pydantic schemas for type-safe extraction
JavaScript support: Built-in rendering for React, Vue, and Angular sites
Multi-provider: Use OpenAI, Anthropic, or local models via Ollama

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Setup
Installation
Install ScrapeGraphAI and Playwright for browser automation:
pip install scrapegraphai playwright
playwright install

OpenAI Configuration
For cloud-based extraction, you’ll need an OpenAI API key. Store it in a .env file:
OPENAI_API_KEY=your-api-key-here

Then load it in your script:
from dotenv import load_dotenv
import os

load_dotenv()

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}

Local Models with Ollama
For zero API costs, use local models via Ollama. ScrapeGraphAI requires two models:

LLM (llama3.2): Interprets your prompts and extracts data
Embedding model (nomic-embed-text): Converts page content into a format the LLM can search

📖 New to Ollama? See our complete guide to running local LLMs with Ollama.

Install Ollama and pull both:
# Install Ollama from https://ollama.ai
ollama pull llama3.2
ollama pull nomic-embed-text

Then configure ScrapeGraphAI to use local inference:
graph_config_local = {
"llm": {
"model": "ollama/llama3.2",
"temperature": 0,
"format": "json",
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434",
},
"verbose": False,
"headless": True,
}

The same extraction code works with both configurations. Switch between cloud and local by changing the config.
Natural Language Prompts
ScrapeGraphAI extraction works in three steps:

Prompt: Describe the data you want in plain English
Source: Provide the URL to scrape
Config: Set your LLM provider and credentials

Pass these to SmartScraperGraph and call run():
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
import os

load_dotenv()

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}

smart_scraper = SmartScraperGraph(
prompt="Extract the first 5 book titles and their prices",
source="https://books.toscrape.com",
config=graph_config,
)

result = smart_scraper.run()

Output:
{'content': [{'price': '£51.77', 'title': 'A Light in the Attic'},
{'price': '£53.74', 'title': 'Tipping the Velvet'},
{'price': '£50.10', 'title': 'Soumission'},
{'price': '£47.82', 'title': 'Sharp Objects'},
{'price': '£54.23', 'title': 'Sapiens: A Brief History of Humankind'}]}

The LLM understood “first 5 book titles and their prices” without any knowledge of the page’s HTML structure.
Structured Output with Pydantic
Raw scraped data often needs cleaning and validation. With ScrapeGraphAI, you can define a Pydantic schema to get type-safe, validated output directly from extraction.
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from pydantic import BaseModel, Field
from typing import List
import os

load_dotenv()

class Book(BaseModel):
title: str = Field(description="The title of the book")
price: float = Field(description="Price in GBP as a number")
rating: int = Field(description="Star rating from 1 to 5")

class BookCatalog(BaseModel):
books: List[Book]

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}

smart_scraper = SmartScraperGraph(
prompt="Extract the first 3 books with their titles, prices, and star ratings",
source="https://books.toscrape.com",
schema=BookCatalog,
config=graph_config,
)

result = smart_scraper.run()

Output:
{'books': [{'price': 51.77, 'rating': 5, 'title': 'A Light in the Attic'},
{'price': 53.74, 'rating': 5, 'title': 'Tipping the Velvet'},
{'price': 50.1, 'rating': 5, 'title': 'Soumission'}]}

The output matches the Pydantic schema:

price: Converted from '£51.77' string to 51.77 float
rating: Extracted from star icons as integer 5
title: Captured as string

The schema ensures:

price is extracted as a float, not a string like "£51.77"
rating is converted to an int from the star display
Missing or invalid fields raise validation errors

The data is analysis-ready, so you don’t need any post-processing in pandas.
For more advanced LLM output validation patterns, see our PydanticAI guide.
JavaScript Content
Modern websites built with React, Vue, or Angular render content dynamically. BeautifulSoup only parses the initial HTML before JavaScript runs, so it misses the actual content.
To demonstrate this, let’s fetch a JavaScript-rendered page with BeautifulSoup:
from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get("https://quotes.toscrape.com/js/").content, "html.parser")
print(soup.select(".quote"))

Output:
[]

The result is an empty list because the content loads via JavaScript after the initial HTML is served.
Selenium can handle JavaScript, but requires explicit waits and complex timing logic.
ScrapeGraphAI uses Playwright to handle JavaScript rendering automatically. The headless parameter controls whether the browser runs visibly or in the background:
from scrapegraphai.graphs import SmartScraperGraph
from dotenv import load_dotenv
import os

load_dotenv()

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True, # Browser runs in background
}

# quotes.toscrape.com/js loads content via JavaScript
smart_scraper = SmartScraperGraph(
prompt="Extract the first 3 quotes with their text and authors",
source="https://quotes.toscrape.com/js/",
config=graph_config,
)

result = smart_scraper.run()

Output:
{'content': [{'author': 'Albert Einstein',
'quote': 'The world as we have created it is a process of our '
'thinking. It cannot be changed without changing our '
'thinking.'},
{'author': 'J.K. Rowling',
'quote': 'It is our choices, Harry, that show what we truly are, '
'far more than our abilities.'},
{'author': 'Albert Einstein',
'quote': 'There are only two ways to live your life. One is as '
'though nothing is a miracle. The other is as though '
'everything is a miracle.'}]}

Unlike the empty BeautifulSoup result, ScrapeGraphAI successfully extracted all three quotes from the JavaScript-rendered page. The LLM chose sensible field names (author, quote) based solely on our natural language prompt.
Multi-Page Scraping
Research tasks often require data from multiple sources. Scraping multiple sites usually requires building individual scrapers for each layout, then manually combining the results into a unified format.
SearchGraph automates this workflow. It searches the web, scrapes relevant pages, and returns aggregated results:
from scrapegraphai.graphs import SearchGraph
import os
from dotenv import load_dotenv

load_dotenv()

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"max_results": 3,
"verbose": False,
}

search_graph = SearchGraph(
prompt="Find the top 3 Python web scraping libraries and their GitHub stars",
config=graph_config,
)

result = search_graph.run()

Output:
{'sources': ['https://github.com/luminati-io/Python-scraping-libraries',
'https://brightdata.com/blog/web-data/python-web-scraping-libraries',
'https://www.geeksforgeeks.org/python/python-web-scraping-tutorial/',
'https://www.projectpro.io/article/python-libraries-for-web-scraping/625'],
'top_libraries': [{'github_stars': '~52.3k', 'name': 'Requests'},
{'github_stars': '~53.7k', 'name': 'Scrapy'},
{'github_stars': '~31.2k', 'name': 'Selenium'},
{'github_stars': 1800, 'name': 'BeautifulSoup'}]}

For scraping multiple known URLs with the same prompt, use SmartScraperMultiGraph:
from scrapegraphai.graphs import SmartScraperMultiGraph
import os
from dotenv import load_dotenv

load_dotenv()

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}

multi_scraper = SmartScraperMultiGraph(
prompt="Extract the page title and main heading",
source=[
"https://books.toscrape.com",
"https://quotes.toscrape.com",
],
config=graph_config,
)

result = multi_scraper.run()

Output:
{'main_headings': ['All products', 'Quotes to Scrape'],
'page_titles': ['Books to Scrape', 'Quotes to Scrape'],
'sources': ['https://books.toscrape.com', 'https://quotes.toscrape.com']}

Both approaches return consistent, structured output regardless of the underlying HTML differences between sites.
Key Takeaways
ScrapeGraphAI shifts web scraping from writing CSS selectors to describing the data you want:

Natural language prompts replace hard-coded CSS selectors and XPath expressions
Pydantic schemas provide type-safe, validated output ready for analysis
Built-in JavaScript rendering handles React, Vue, and Angular sites automatically
Multi-provider support lets you choose between cloud APIs and local models
SearchGraph automates multi-source research with a single prompt

The library is best suited for:

Exploratory data collection where site structures vary
Research tasks requiring data from multiple sources
Projects where scraper maintenance costs exceed development time
Extracting structured data from JavaScript-heavy applications

For high-volume production workloads on sites with stable HTML, Scrapy remains the faster choice. ScrapeGraphAI pays off when the time saved on selector updates outweighs the per-request LLM cost.
Related Tutorials

Turn Receipt Images into Spreadsheets with LlamaIndex: Extract structured data from images and PDFs instead of web pages
Transform Any PDF into Searchable AI Data with Docling: Convert PDF documents into RAG-ready structured data

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

From CSS Selectors to Natural Language: Web Scraping with ScrapeGraphAI Read More »

Python-Magic: Reliable File Type Detection Beyond Extensions

Table of Contents

Motivation
Introduction to Python-Magic
Data Setup
Accurate File Type Detection
Practical Applications
Conclusion

Python-Magic: Reliable File Type Detection Beyond Extensions
Motivation
File extensions can be misleading or missing entirely. Data processing workflows often receive files from various sources with incorrect extensions, renamed files, or files without extensions altogether.
Traditional approaches rely on file extensions, which can be easily manipulated or missing:
import os

# Traditional approach – unreliable
def get_file_type_by_extension(filename):
_, ext = os.path.splitext(filename)
return ext.lower()

# Examples of problematic files
files = ["document.txt", "data", "image.jpg.exe"]
for file in files:
ext = get_file_type_by_extension(file)
print(f"{file}: {ext if ext else 'No extension'}")

document.txt: .txt
data: No extension
image.jpg.exe: .exe

This approach fails when:

Files lack extensions
Extensions are incorrect or misleading
Malicious files masquerade as safe file types

Introduction to Python-Magic
Python-magic provides reliable file type detection by analyzing file headers rather than relying on extensions. It interfaces with libmagic, the same library used by the Unix file command.
Install python-magic and the required system library:
pip install python-magic

# Install libmagic system library
# macOS: brew install libmagic
# Ubuntu/Debian: sudo apt-get install libmagic1

Data Setup
To demonstrate python-magic’s capabilities, we’ll create test files with misleading extensions that highlight the limitations of extension-based file type detection:
import magic

# Create standard test files
with open("data/sample.txt", "w") as f:
f.write("This is a sample text file for demonstration purposes.")

with open("data/sample.py", "w") as f:
f.write("import pandas as pd; df = pd.DataFrame({'a': [1, 2, 3]}); print(df)")

# Create test files with misleading extensions
with open("data/fake_image.txt", "wb") as f:
# Write PNG header
f.write(b'\x89PNG\r\n\x1a\n')

with open("data/real_text.jpg", "w") as f:
f.write("This is actually a text file")

These test files will demonstrate how python-magic detects actual file types regardless of their extensions.
Accurate File Type Detection
Python-magic examines file headers to determine actual file types:
# Detect actual file types
files = [
"data/sample.txt",
"data/sample.py",
"data/fake_image.txt",
"data/real_text.jpg",
]

for file in files:
file_type = magic.from_file(file)
mime_type = magic.from_file(file, mime=True)
print(f"{file}:")
print(f" Type: {file_type}")
print(f" MIME: {mime_type}")
print()

data/sample.txt:
Type: ASCII text, with no line terminators
MIME: text/plain

data/sample.py:
Type: ASCII text, with no line terminators
MIME: text/plain

data/fake_image.txt:
Type: data
MIME: application/octet-stream

data/real_text.jpg:
Type: ASCII text, with no line terminators
MIME: text/plain

Python-magic correctly identifies:

fake_image.txt as a PNG image despite the .txt extension
real_text.jpg as plain text despite the .jpg extension

📚 For comprehensive production data validation practices, check out Production-Ready Data Science.

Practical Applications
With python-magic, we can build a validation function that verifies uploaded files match expected formats and rejects files with misleading extensions.
Before creating a validation function, let’s set up a sample CSV file for our validation example:
# Create sample CSV file for validation example
import pandas as pd

sample_data = pd.DataFrame({
'ticket_id': [1, 2, 3, 4, 5],
'customer_type': ['premium', 'basic', 'premium', 'basic', 'premium'],
'issue_category': ['billing', 'technical', 'account', 'billing', 'technical'],
'resolution_time': [24, 48, 12, 36, 18]
})

sample_data.to_csv('data/customer_support_eval.csv', index=False)

Next, create a file validation function for data processing pipelines:
def validate_uploaded_file(filepath, expected_types):
"""Validate file type matches expectations"""
try:
actual_mime = magic.from_file(filepath, mime=True)

if actual_mime in expected_types:
return True, f"Valid {actual_mime} file"
else:
return False, f"Expected {expected_types}, got {actual_mime}"
except Exception as e:
return False, f"Error reading file: {e}"

# Example usage for data analysis workflow
csv_file = "data/customer_support_eval.csv"
result, message = validate_uploaded_file(csv_file, ["text/csv", "text/plain"])
print(f"CSV validation: {message}")

# Check for potentially dangerous files
suspicious_file = "data/fake_image.txt"
result, message = validate_uploaded_file(suspicious_file, ["text/plain"])
print(f"Text validation: {message}")

CSV validation: Valid text/csv file
Text validation: Expected ['text/plain'], got application/octet-stream

The validation correctly accepts the CSV file while rejecting the fake image file despite its .txt extension.

For robust logging in production file validation workflows, see our Loguru: Simple as Print, Powerful as Logging.

Conclusion
Python-magic provides reliable file type detection by examining file content rather than trusting extensions. This approach prevents security vulnerabilities and ensures data processing workflows handle files correctly.

For managing file validation configurations across different environments, see our Hydra for Python Configuration: Build Modular and Maintainable Pipelines.

Python-Magic: Reliable File Type Detection Beyond Extensions Read More »

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran