Parsera: Natural Language Web Scraping with LLMs

Writing and maintaining web scraping code requires constant updates due to changing HTML structures and complex selectors, which results in brittle code and frequent breakages.

With Parsera, you can scrape websites by simply describing what data you want to extract in plain language, letting LLMs handle the complexity of finding the right elements.

Here’s an example that scrapes GitHub’s trending Python repositories page to collect:

  • Fork counts
  • Repository names
  • Repository owners
  • Star counts
import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY_HERE"
from parsera import Parsera
from pprint import pprint

url = "https://github.com/trending/python?since=daily"
elements = {
    "Repository": "Name of the repository",
    "Owner": "Owner of the repository",
    "Stars": "Number of stars",
    "Forks": "Number of forks",
}

scraper = Parsera()
result = scraper.run(url=url, elements=elements)
pprint(result)
[{'Forks': '272', 'Owner': 'DS4SD', 'Repository': 'docling', 'Stars': '5,264'},
 {'Forks': '2,502',
  'Owner': 'mingrammer',
  'Repository': 'diagrams',
  'Stars': '38,714'},
 {'Forks': '3,883',
  'Owner': 'All-Hands-AI',
  'Repository': 'OpenHands',
  'Stars': '34,095'},
 {'Forks': '7,288',
  'Owner': 'frappe',
  'Repository': 'erpnext',
  'Stars': '21,616'},
 {'Forks': '7,249',
  'Owner': 'abi',
  'Repository': 'screenshot-to-code',
  'Stars': '58,574'},
 {'Forks': '46,154',
  'Owner': 'donnemartin',
  'Repository': 'system-design-primer',
  'Stars': '274,420'},
 ...
]

Link to Parsera.

Leave a Comment

Your email address will not be published. Required fields are marked *

Related Posts

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran