Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

browser-use: Turn Plain English Prompts into Working Browser Automation

browser-use: Turn Plain English Prompts into Working Browser Automation

Table of Contents

Introduction

Traditional browser automation tools like Playwright require you to write CSS selectors for every element you want to extract:

# Extract a laptop's price from a product card
price = await card.locator("h4.price").text_content()

This approach works, but it tightly couples your scraper to the site’s HTML structure. If a class name changes, your scraper breaks. You then have to inspect the updated HTML and rewrite your selectors from scratch.

What if you could just describe what you want in plain English?

# Tell an AI agent what to extract
agent = Agent(
    task="Find gaming laptops under $1500 and extract the name, price, and GPU",
    llm=ChatOpenAI(model="gpt-4o"),
)

That is what browser-use does. Instead of describing the steps, you describe the goal, and an LLM works out the steps for you.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

What is browser-use?

browser-use is a Python library that gives an LLM a working browser. Under the hood it uses Playwright to drive the browser, but the LLM reads each page and decides what to click, type, and extract. You write the task in plain English, and the agent figures out the rest.

To install browser-use:

pip install browser-use
playwright install chromium

This article uses browser-use v0.12.5.

Since this tutorial uses OpenAI’s GPT-4o as the agent’s model, you will need an OpenAI API key. Store it in a .env file:

OPENAI_API_KEY=your-key-here

Then load it with python-dotenv:

import nest_asyncio

nest_asyncio.apply()
from dotenv import load_dotenv
load_dotenv()

Synthesizing Hacker News Themes

We will point browser-use at Hacker News and ask it to:

Find the top AI-related stories on the front page and identify the common themes.

This task is a good fit for browser-use because it mixes extraction with judgment: the agent has to classify each story and then reason across all of them to pull out themes.

Let’s set it up. First, define the output schema using Pydantic. This tells browser-use what structure to return:

import asyncio
from pydantic import BaseModel
from browser_use import Agent, ChatOpenAI


class HNResults(BaseModel):
    titles: list[str]
    points: list[int]
    comments: list[int]
    urls: list[str]
    themes: list[str]
    summary: str

Now define the agent. It takes four parameters:

  • task: the natural language instructions for what you want the agent to do.
  • llm: the model that drives the agent. Here we use GPT-4o.
  • output_model_schema: the Pydantic schema that tells the agent how to structure its final result.
  • calculate_cost: when set to True, browser-use tracks token usage and dollar cost so you can inspect them later.
async def find_ai_stories():
    # Configure the agent with task, model, schema, and cost tracking
    agent = Agent(
        task=(
            "Go to https://news.ycombinator.com/ and find "
            "all stories on the front page that are "
            "about AI, LLMs, or AI agents. "
            "For each story, extract the title, points, "
            "comment count, and URL. "
            "Then identify 2-3 common themes across these "
            "stories and write a short summary of what the "
            "Hacker News community is currently excited or "
            "concerned about regarding AI."
        ),
        llm=ChatOpenAI(model="gpt-4o"),
        output_model_schema=HNResults,
        calculate_cost=True,
    )

    # Run the agent and get the structured result
    history = await agent.run()
    result = history.final_result()

    # Parse into HNResults, falling back to an empty result if nothing came back
    parsed = HNResults.model_validate_json(result) if result else HNResults(
        titles=[], points=[], comments=[], urls=[], themes=[], summary="",
    )
    return parsed, history

Run the agent:

results, history = asyncio.run(find_ai_stories())
print(f"Found {len(results.titles)} AI-related stories")
📍 Step 1:
  👍 Eval: Successfully navigated to the Hacker News front page
         and identified several stories that may relate to AI.
  🧠 Memory: On the Hacker News front page. Identified potential
            AI-related stories by their titles. Need to extract
            details for analysis.
  🎯 Next goal: Extract details from the identified AI-related
              stories on the front page.
  ▶️  extract: query: AI|LLM|AI agent, extract_links: True

📍 Step 2:
  👍 Eval: Successfully extracted details of AI-related stories
         from Hacker News.
  🧠 Memory: Extracted details of AI-related stories. Ready to
            analyze for common themes and summarize.
  🎯 Next goal: Analyze the extracted stories to identify 2-3
              common themes and write a summary.
  ▶️  done: 8 stories extracted, 3 themes identified

✅ Task completed successfully

⚠️  Agent reported success but judge thinks task failed
⚖️  Judge Verdict: ❌ FAIL
   Failure Reason: The agent included a non-AI related story
   ('Solod – A subset of Go that translates to C') in its results.

Found 8 AI-related stories

Each step in the log shows the agent’s internal reasoning loop. There are four fields, and they work together to drive the next action:

  • Eval checks whether the last action worked. This makes the agent self-correcting, so failures get retried instead of silently propagating.
  • Memory tracks what has been done so far. This stops the agent from repeating expensive actions and is what makes multi-step tasks possible.
  • Next goal plans the next action based on eval and memory. The LLM decides what to do next, so you do not have to write a state machine.
  • Action executes the plan from a built-in toolkit of navigate, click, extract, and scroll. The agent picks the right tool on its own, so the same code works on a new site.
%%{init: {"theme": "dark"}}%%
flowchart TD
    Task([Task prompt]) --> Eval
    Eval[Eval: check last action] --> Memory[Memory: track progress]
    Memory --> NextGoal[Next goal: plan next step]
    NextGoal --> Action[Action: execute from toolkit]
    Action -->|Not done| Eval
    Action -->|Done| Result([Worker result])

Once the worker declares success, a second LLM steps in as a judge.

This is a clever setup. Workers rarely catch their own mistakes without an independent perspective. Since the judge only sees the prompt and final result, it can objectively evaluate the output.

Here, it correctly identified that Solod is not an AI story.

%%{init: {"theme": "dark"}}%%
flowchart TD
    Result([Worker result]) --> Review[Judge: review against original task]
    Review --> Verdict{Pass or fail?}
    Verdict -->|Pass| Output([Trusted output])
    Verdict -->|Fail| Retry([Retry or alert])

Let’s look at the results:

for title, points, comments, url in zip(
    results.titles, results.points, results.comments, results.urls
):
    print(f"  {title}")
    print(f"  {points} points | {comments} comments")
    print(f"  {url}")
    print()

print("Themes:")
for theme in results.themes:
    print(f"  - {theme}")

print(f"\nSummary: {results.summary}")
Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS
  363 points | 170 comments
  https://github.com/matthartman/ghost-pepper

  Solod – A subset of Go that translates to C
  122 points | 30 comments
  https://github.com/solod-dev/solod

  Issue: Claude Code is unusable for complex engineering tasks with Feb updates
  1047 points | 587 comments
  https://github.com/anthropics/claude-code/issues/42796

  Sam Altman may control our future – can he be trusted?
  1340 points | 542 comments
  https://www.newyorker.com/magazine/2026/04/13/sam-altman-may-control-our-future-can-he-be-trusted

  Launch HN: Freestyle – Sandboxes for Coding Agents
  265 points | 145 comments
  https://www.freestyle.sh/

  A cryptography engineer's perspective on quantum computing timelines
  455 points | 186 comments
  https://words.filippo.io/crqc-timeline/

  AI singer now occupies eleven spots on iTunes singles chart
  166 points | 257 comments
  https://www.showbiz411.com/2026/04/05/itunes-takeover-by-fake-ai-singer...

  Show HN: Hippo, biologically inspired memory for AI agents
  90 points | 17 comments
  https://github.com/kitfunso/hippo-memory

Themes:
  - AI in media and entertainment
  - Ethical concerns about AI leadership
  - Technical challenges in AI development

Summary: The Hacker News community is currently excited about advancements in AI tools like speech-to-text applications and biologically inspired memory systems. There are also significant discussions around ethical concerns regarding influential figures in AI like Sam Altman. Additionally, there are technical challenges being highlighted in the development of complex AI systems.

Most of the output is solid:

  • Structured fields are accurate. Titles, points, comments, and URLs are correct for every row.
  • Themes map to real stories. “AI in media and entertainment” lines up with the AI singer post, “Ethical concerns about AI leadership” with the Sam Altman piece, and “Technical challenges in AI development” with the Claude Code issue.
  • The summary reads like a take, not a list. Instead of restating each story, it picks out what the community is excited about (new AI tools), worried about (Sam Altman), and struggling with (complex AI systems).

However, the classifications are wrong in two places. Solod (a Go-to-C transpiler) and the quantum computing post both made the list even though neither is about AI.

Working with the Output

Once you have the structured output, you can do something with it. The simplest first step is to load it into a pandas DataFrame:

import pandas as pd

df = pd.DataFrame({
    "title": results.titles,
    "points": results.points,
    "comments": results.comments,
    "url": results.urls,
})
df
titlepointscommentsurl
0Show HN: Ghost Pepper – Local hold-to-talk spe…363170https://github.com/matthartman/ghost-pepper
1Solod – A subset of Go that translates to C12230https://github.com/solod-dev/solod
2Issue: Claude Code is unusable for complex eng…1047587https://github.com/anthropics/claude-code/issu…
3Sam Altman may control our future – can he be …1340542https://www.newyorker.com/magazine/2026/04/13/…
4Launch HN: Freestyle – Sandboxes for Coding Ag…265145https://www.freestyle.sh/
5A cryptography engineer’s perspective on quant…455186https://words.filippo.io/crqc-timeline/
6AI singer now occupies eleven spots on iTunes …166257https://www.showbiz411.com/2026/04/05/itunes-t…
7Show HN: Hippo, biologically inspired memory f…9017https://github.com/kitfunso/hippo-memory

Now you can drop the misclassified rows the judge flagged earlier and keep only the true AI stories:

misclassified_rows = [1, 5]
ai_only = df.drop(misclassified_rows).reset_index(drop=True)
ai_only
titlepointscommentsurl
0Show HN: Ghost Pepper – Local hold-to-talk spe…363170https://github.com/matthartman/ghost-pepper
1Issue: Claude Code is unusable for complex eng…1047587https://github.com/anthropics/claude-code/issu…
2Sam Altman may control our future – can he be …1340542https://www.newyorker.com/magazine/2026/04/13/…
3Launch HN: Freestyle – Sandboxes for Coding Ag…265145https://www.freestyle.sh/
4AI singer now occupies eleven spots on iTunes …166257https://www.showbiz411.com/2026/04/05/itunes-t…
5Show HN: Hippo, biologically inspired memory f…9017https://github.com/kitfunso/hippo-memory

What This Run Cost

Now let’s check what this synthesis cost in tokens, dollars, and time.

usage = history.usage
print(f"Total tokens: {usage.total_tokens:,}")
print(f"Total cost: ${usage.total_cost:.4f}")
print(f"Steps: {len(history.history)}")
print(f"Duration: {history.total_duration_seconds():.1f}s")
Total tokens: 42,339
Total cost: $0.1157
Steps: 3
Duration: 53.2s

12 cents and 53 seconds is what this run cost. Here is what the same money buys you elsewhere:

  • GPT-4o. A single long-prompt API call. Same dollar cost, but no browser, no scraping, no real page involved.
  • Your own time. Opening Hacker News, skimming 30 stories, copying the AI ones into a doc, and writing a summary by hand. Probably 10 to 20 minutes of focused work.
  • Playwright + LLM. A scraper to grab the page, one LLM call to classify, another to synthesize, and the code to glue them together. More code, more failure points, and likely more total tokens.

If 12 cents per run is too much for your use case, there are three ways to bring it down:

  • Use a cheaper model. Swap gpt-4o for gpt-4o-mini or claude-haiku. The same task often runs for under 2 cents, with some loss of reasoning quality.
  • Run a local model. browser-use works with Ollama and LM Studio, so the dollar cost drops to zero. The trade-off is that local models need to be 30B+ to handle structured output reliably, and each step is slower.
  • Tighten the prompt. Shorter tasks mean fewer steps, and each step carries the full conversation history forward, so cutting one step can save thousands of tokens.

For a full walkthrough on setting up local LLMs for workflows like this, see our LangChain and Ollama guide.

A Second Experiment: Scraping Newegg

browser-use doesn’t always work this well. To show where it breaks, I ran it on a harder task: scraping Newegg for gaming laptops. Here is the prompt:

task=(
    "Go to https://www.newegg.com/Gaming-Laptops"
    "/SubCategory/ID-3365 and find gaming laptops "
    "matching these criteria:\n"
    "- Price: $0-$1500\n"
    "- GPU: NVIDIA GeForce RTX 50 Series\n"
    "- RAM: 32GB\n"
    "For each laptop, extract the name, "
    "price, GPU, CPU, RAM, and storage. "
    "Then pick the best value and explain why."
)

This run didn’t go well. Here are the issues:

  • Price filter: The agent typed the values but skipped APPLY, so the listings never refreshed.
  • 32GB RAM constraint: Silently dropped. The final results all had 16GB.
  • Pagination: Stopped at the first page instead of collecting results from all pages.

Trade-offs

browser-use is powerful, but it comes with real trade-offs:

  • Speed: The agent took 30–60 seconds for the Hacker News task. Each step requires LLM reasoning, while a Playwright script would finish in ~5 seconds.
  • Cost: A single run is cheap (~$0.12), but costs grow quickly. Since each step carries full context, doubling steps can cost more than 2x.
  • Non-determinism: Results vary between runs. The agent may take different actions, and the judge may reach different conclusions.
  • Task fit: Results depend heavily on the page and what you ask for. The same tool can work well on one site and fail on another.

When to Use Each Tool

Knowing the trade-offs, here is how to decide which tool to reach for:

MetricPlaywrightbrowser-use
SpeedFastSlower
Cost per runFreePaid per LLM call
DeterministicYesNo
Works on a new siteNeeds new selectorsChange the URL
Handles reasoning tasksNo (hardcoded rules)Yes (LLM reasons)
Exact constraint satisfactionYesNo (silently relaxes hard constraints)

Choose Playwright when:

  • You need identical results every run
  • Speed matters (high-volume or frequent runs)
  • You need exact constraint satisfaction (e.g., “every result must have 32GB RAM”)

Choose browser-use when:

  • The task requires judgment or classification (e.g., “which of these stories are about AI”) rather than pattern matching
  • The task requires synthesis across multiple items (e.g., “what themes connect these”)
  • The page or task may change over time and you do not want to maintain selectors
  • You want scraping, classification, and reasoning in a single prompt instead of three separate pipelines

Conclusion

The best way to understand browser-use is to try it on a small project you actually care about. Here are a few ideas:

  • Summarize a Reddit thread. Provide a URL and ask for the top 3 arguments. Then schedule it to run daily across selected subreddits and post summaries to Slack.
  • Pull today’s top stories from a news site. Extract titles, sources, and short summaries, then schedule a daily digest sent to your inbox.
  • Watch the price of a single product. Monitor a product page and get notified when the price drops below a set threshold.

Each of these starts with a single prompt, costs just a few cents to test, and can quickly become something you use every day.

Related Tutorials

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran