From CSS Selectors to Natural Language: Web Scraping with ScrapeGraphAI
Table of Contents
Introduction
What is ScrapeGraphAI?
Setup
Installation
OpenAI Configuration
Local Models with Ollama
Natural Language Prompts
Structured Output with Pydantic
JavaScript Content
Multi-Page Scraping
Key Takeaways
Introduction
BeautifulSoup is the go-to library for web scraping thanks to its simple API and flexible parsing. The workflow is straightforward: fetch HTML, inspect elements in DevTools, and write selectors to extract data:
from pprint import pprint
from bs4 import BeautifulSoup
import requests
url = "https://books.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
books = []
for article in soup.select("article.product_pod"):
title = article.select_one("h3 a")["title"]
price = article.select_one("p.price_color").text
books.append({"title": title, "price": price})
pprint(books[:3])
Output:
[{'price': '£51.77', 'title': 'A Light in the Attic'},
{'price': '£53.74', 'title': 'Tipping the Velvet'},
{'price': '£50.10', 'title': 'Soumission'}]
The output is correct, but selectors are tightly coupled to the HTML structure. This means when the site redesigns, everything breaks, so you spend more time maintaining selectors than extracting data:
# Before: <article class="product_pod">
# After: <div class="book-card">
soup.select("article.product_pod") # Now returns []
# Before: <p class="price_color">£51.77</p>
# After: <span class="price">£51.77</span>
soup.select_one("p.price_color") # Returns None, crashes on .text
What if you could just describe the data you want and let an LLM figure out the extraction? That’s where ScrapeGraphAI comes in.
💻 Get the Code: The complete source code for this tutorial are available on GitHub. Clone it to follow along!
What is ScrapeGraphAI?
ScrapeGraphAI is an open-source Python library for LLM-powered web scraping. Rather than writing CSS selectors, you describe the data you want in plain English.
Key benefits:
No selector maintenance: Describe what data you want, not where it lives in the HTML
Self-healing scrapers: The LLM adjusts automatically when websites redesign
Structured output: Define Pydantic schemas for type-safe extraction
JavaScript support: Built-in rendering for React, Vue, and Angular sites
Multi-provider: Use OpenAI, Anthropic, or local models via Ollama
Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.
.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}
.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}
Subscribe
Setup
Installation
Install ScrapeGraphAI and Playwright for browser automation:
pip install scrapegraphai playwright
playwright install
OpenAI Configuration
For cloud-based extraction, you’ll need an OpenAI API key. Store it in a .env file:
OPENAI_API_KEY=your-api-key-here
Then load it in your script:
from dotenv import load_dotenv
import os
load_dotenv()
graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}
Local Models with Ollama
For zero API costs, use local models via Ollama. ScrapeGraphAI requires two models:
LLM (llama3.2): Interprets your prompts and extracts data
Embedding model (nomic-embed-text): Converts page content into a format the LLM can search
📖 New to Ollama? See our complete guide to running local LLMs with Ollama.
Install Ollama and pull both:
# Install Ollama from https://ollama.ai
ollama pull llama3.2
ollama pull nomic-embed-text
Then configure ScrapeGraphAI to use local inference:
graph_config_local = {
"llm": {
"model": "ollama/llama3.2",
"temperature": 0,
"format": "json",
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434",
},
"verbose": False,
"headless": True,
}
The same extraction code works with both configurations. Switch between cloud and local by changing the config.
Natural Language Prompts
ScrapeGraphAI extraction works in three steps:
Prompt: Describe the data you want in plain English
Source: Provide the URL to scrape
Config: Set your LLM provider and credentials
Pass these to SmartScraperGraph and call run():
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
import os
load_dotenv()
graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}
smart_scraper = SmartScraperGraph(
prompt="Extract the first 5 book titles and their prices",
source="https://books.toscrape.com",
config=graph_config,
)
result = smart_scraper.run()
Output:
{'content': [{'price': '£51.77', 'title': 'A Light in the Attic'},
{'price': '£53.74', 'title': 'Tipping the Velvet'},
{'price': '£50.10', 'title': 'Soumission'},
{'price': '£47.82', 'title': 'Sharp Objects'},
{'price': '£54.23', 'title': 'Sapiens: A Brief History of Humankind'}]}
The LLM understood “first 5 book titles and their prices” without any knowledge of the page’s HTML structure.
Structured Output with Pydantic
Raw scraped data often needs cleaning and validation. With ScrapeGraphAI, you can define a Pydantic schema to get type-safe, validated output directly from extraction.
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from pydantic import BaseModel, Field
from typing import List
import os
load_dotenv()
class Book(BaseModel):
title: str = Field(description="The title of the book")
price: float = Field(description="Price in GBP as a number")
rating: int = Field(description="Star rating from 1 to 5")
class BookCatalog(BaseModel):
books: List[Book]
graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}
smart_scraper = SmartScraperGraph(
prompt="Extract the first 3 books with their titles, prices, and star ratings",
source="https://books.toscrape.com",
schema=BookCatalog,
config=graph_config,
)
result = smart_scraper.run()
Output:
{'books': [{'price': 51.77, 'rating': 5, 'title': 'A Light in the Attic'},
{'price': 53.74, 'rating': 5, 'title': 'Tipping the Velvet'},
{'price': 50.1, 'rating': 5, 'title': 'Soumission'}]}
The output matches the Pydantic schema:
price: Converted from '£51.77' string to 51.77 float
rating: Extracted from star icons as integer 5
title: Captured as string
The schema ensures:
price is extracted as a float, not a string like "£51.77"
rating is converted to an int from the star display
Missing or invalid fields raise validation errors
The data is analysis-ready, so you don’t need any post-processing in pandas.
For more advanced LLM output validation patterns, see our PydanticAI guide.
JavaScript Content
Modern websites built with React, Vue, or Angular render content dynamically. BeautifulSoup only parses the initial HTML before JavaScript runs, so it misses the actual content.
To demonstrate this, let’s fetch a JavaScript-rendered page with BeautifulSoup:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("https://quotes.toscrape.com/js/").content, "html.parser")
print(soup.select(".quote"))
Output:
[]
The result is an empty list because the content loads via JavaScript after the initial HTML is served.
Selenium can handle JavaScript, but requires explicit waits and complex timing logic.
ScrapeGraphAI uses Playwright to handle JavaScript rendering automatically. The headless parameter controls whether the browser runs visibly or in the background:
from scrapegraphai.graphs import SmartScraperGraph
from dotenv import load_dotenv
import os
load_dotenv()
graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True, # Browser runs in background
}
# quotes.toscrape.com/js loads content via JavaScript
smart_scraper = SmartScraperGraph(
prompt="Extract the first 3 quotes with their text and authors",
source="https://quotes.toscrape.com/js/",
config=graph_config,
)
result = smart_scraper.run()
Output:
{'content': [{'author': 'Albert Einstein',
'quote': 'The world as we have created it is a process of our '
'thinking. It cannot be changed without changing our '
'thinking.'},
{'author': 'J.K. Rowling',
'quote': 'It is our choices, Harry, that show what we truly are, '
'far more than our abilities.'},
{'author': 'Albert Einstein',
'quote': 'There are only two ways to live your life. One is as '
'though nothing is a miracle. The other is as though '
'everything is a miracle.'}]}
Unlike the empty BeautifulSoup result, ScrapeGraphAI successfully extracted all three quotes from the JavaScript-rendered page. The LLM chose sensible field names (author, quote) based solely on our natural language prompt.
Multi-Page Scraping
Research tasks often require data from multiple sources. Scraping multiple sites usually requires building individual scrapers for each layout, then manually combining the results into a unified format.
SearchGraph automates this workflow. It searches the web, scrapes relevant pages, and returns aggregated results:
from scrapegraphai.graphs import SearchGraph
import os
from dotenv import load_dotenv
load_dotenv()
graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"max_results": 3,
"verbose": False,
}
search_graph = SearchGraph(
prompt="Find the top 3 Python web scraping libraries and their GitHub stars",
config=graph_config,
)
result = search_graph.run()
Output:
{'sources': ['https://github.com/luminati-io/Python-scraping-libraries',
'https://brightdata.com/blog/web-data/python-web-scraping-libraries',
'https://www.geeksforgeeks.org/python/python-web-scraping-tutorial/',
'https://www.projectpro.io/article/python-libraries-for-web-scraping/625'],
'top_libraries': [{'github_stars': '~52.3k', 'name': 'Requests'},
{'github_stars': '~53.7k', 'name': 'Scrapy'},
{'github_stars': '~31.2k', 'name': 'Selenium'},
{'github_stars': 1800, 'name': 'BeautifulSoup'}]}
For scraping multiple known URLs with the same prompt, use SmartScraperMultiGraph:
from scrapegraphai.graphs import SmartScraperMultiGraph
import os
from dotenv import load_dotenv
load_dotenv()
graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}
multi_scraper = SmartScraperMultiGraph(
prompt="Extract the page title and main heading",
source=[
"https://books.toscrape.com",
"https://quotes.toscrape.com",
],
config=graph_config,
)
result = multi_scraper.run()
Output:
{'main_headings': ['All products', 'Quotes to Scrape'],
'page_titles': ['Books to Scrape', 'Quotes to Scrape'],
'sources': ['https://books.toscrape.com', 'https://quotes.toscrape.com']}
Both approaches return consistent, structured output regardless of the underlying HTML differences between sites.
Key Takeaways
ScrapeGraphAI shifts web scraping from writing CSS selectors to describing the data you want:
Natural language prompts replace hard-coded CSS selectors and XPath expressions
Pydantic schemas provide type-safe, validated output ready for analysis
Built-in JavaScript rendering handles React, Vue, and Angular sites automatically
Multi-provider support lets you choose between cloud APIs and local models
SearchGraph automates multi-source research with a single prompt
The library is best suited for:
Exploratory data collection where site structures vary
Research tasks requiring data from multiple sources
Projects where scraper maintenance costs exceed development time
Extracting structured data from JavaScript-heavy applications
For high-volume production workloads on sites with stable HTML, Scrapy remains the faster choice. ScrapeGraphAI pays off when the time saved on selector updates outweighs the per-request LLM cost.
Related Tutorials
Turn Receipt Images into Spreadsheets with LlamaIndex: Extract structured data from images and PDFs instead of web pages
Transform Any PDF into Searchable AI Data with Docling: Convert PDF documents into RAG-ready structured data
Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.
.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}
.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}
Subscribe
📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →
From CSS Selectors to Natural Language: Web Scraping with ScrapeGraphAI Read More »

