Stealth Browser Agent

Design notes for a hosted stealth browser agent that AI agents can drive with natural language.

Design notes for stealth-browser-agent, a hosted stealth browser agent on the Apify Store. The tool gives AI agents a browser they can drive with natural language — it handles anti-detection fingerprinting, residential proxy routing, and page interaction, then returns structured JSON.

The Apify Store page covers the schema, current pricing, and how to try it. This page is for the design questions — why the Actor has the shape it does, what alternatives I considered, and what it deliberately refuses.

What this is

The problem class: an AI agent needs to interact with a web page — click a button, fill a form, navigate pagination, extract structured data — but the page blocks automation. The agent doesn’t have its own browser, and even if it did, standard headless browsers are fingerprinted and blocked.

This Actor is hosted browser infrastructure for agents. Send a URL and a task description in plain language, and an LLM copilot drives an anti-detection browser through the flow. The copilot takes screenshots to observe the page, decides which browser tools to call (click, type, navigate, scroll), executes them, and repeats until the task is done. The final result is structured JSON, optionally shaped to a caller-provided schema.

The key decision the Actor makes on each step: what to do next. The LLM receives a screenshot of the current page state and the task description, then picks from a curated set of browser tools. This is the same loop a human follows when using a browser — look at the page, decide what to click, observe the result. The LLM does it programmatically.

Output shape: one dataset record per run containing the final URL, task echo, status, structured result, step-by-step action log, and a screenshot URL. The action log is the debugging surface — if the agent took an unexpected path, the log shows exactly which tools it called and what happened.

Why I built it this way

The browser gap in agent toolkits

AI agents have good tools for fetching page content — HTTP clients, stealth fetchers, headless renderers. But fetching is passive. When the agent needs to interact with a page — click through a product catalog, fill a search form, navigate a multi-step checkout flow — there’s a gap. The agent either needs to shell out to a browser automation script (which someone has to write and maintain per site) or give up.

Browser-as-MCP tools are starting to fill this gap. But they share a problem: no stealth. They launch a standard browser, and any page with bot detection flags it immediately. For cooperative sites this is fine; for the growing share of the web behind bot detection, the browser gets blocked before the agent can do anything useful.

Screenshot-primary observation

The LLM needs to understand the page to decide what to do. Two options: parse the DOM or accessibility tree into text, or look at a screenshot.

I chose screenshots as the primary observation method. A screenshot is fixed-cost — roughly 500 tokens regardless of page complexity. A DOM dump scales with page size; a complex e-commerce page can produce 50K+ tokens of HTML, blowing through context windows and making per-step costs unpredictable.

The tradeoff: screenshots lose precise text content (OCR-equivalent parsing by the LLM is imperfect). So text-extraction and accessibility-tree tools are available as secondary observation — the LLM can call get_text(selector) when it needs exact text from a specific element. But the primary loop is screenshot, decide, act, screenshot.

Per-step pricing, not per-task

Task complexity varies enormously. A simple extraction is 1-2 steps. A form fill with validation is 5-10. A multi-page navigation flow can hit 15-20. Flat per-task pricing either overcharges simple tasks or undercharges complex ones.

Per-step pricing passes cost through honestly. Each step has a real cost — an LLM planning call plus browser action execution — and the customer pays for the steps actually taken. The step log makes this transparent: you can see exactly what the agent did and how many steps it took.

Two tiers exist because proxy bandwidth is a real incremental cost. Not every target site needs residential routing. By splitting the tiers, customers who target cooperative sites pay only for the browser and LLM, while customers targeting defended sites pay the proxy premium. The proxyGeo input field is the tier switch — present means proxy tier, absent means standard.

Why hosted, not self-serve

You could build this yourself: install a stealth browser, configure a proxy, write the LLM loop, manage screenshots, handle errors. The value is that you don’t have to assemble and maintain the stack. For an AI agent that needs to interact with a defended page once in a workflow, spinning up dedicated infrastructure is overkill. This Actor is the “just call it” option.

The MCP integration makes this concrete: an agent discovers the tool on Apify’s MCP server, reads the input schema, constructs a call, and gets structured JSON back. No browser installation, no proxy configuration, no prompt engineering for the browser copilot. For AI agents that need to scrape bot-defended pages or interact with protected sites programmatically, the entire complexity is hidden behind a single API call.

How to use it

A realistic example: extracting product details from a book catalog where you need to click into a product page first.

Input:

{
  "url": "https://books.toscrape.com",
  "task": "Click the first book in the catalog, then extract its title, price, and availability status. Return JSON: {\"title\": \"...\", \"price\": \"...\", \"in_stock\": true}",
  "maxSteps": 10
}

Python SDK:

from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("shelvick/stealth-browser-agent").call(
    run_input={
        "url": "https://books.toscrape.com",
        "task": "Click the first book, extract its title, price, and availability.",
        "maxSteps": 10,
    }
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["status"])   # "completed"
    print(item["result"])   # {"title": "A Light in the Attic", "price": "£51.77", "in_stock": true}
    print(f"{len(item['steps'])} steps taken")

Output record:

{
  "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
  "task": "Click the first book, extract its title, price, and availability.",
  "status": "completed",
  "result": {
    "title": "A Light in the Attic",
    "price": "£51.77",
    "in_stock": true
  },
  "steps": [
    {"step": 1, "tool": "click_element", "args": {"selector": "article h3 a"}, "success": true},
    {"step": 2, "tool": "get_text", "args": {"selector": ".product_main"}, "success": true}
  ],
  "screenshotUrl": "https://api.apify.com/v2/key-value-stores/abc/records/screenshot"
}

If you’re calling from an MCP-enabled agent, the Actor appears as a tool on mcp.apify.com. The agent can construct calls from the tool description and input schema — no SDK installation needed. This is the typical path for hosted browser automation from agent code.

How it compares to alternatives

Approach	Stealth	Interaction	Structured output	Cost model
Self-hosted headless browser	None — detected immediately	Full (write scripts yourself)	Manual extraction	Your infrastructure
Stealth fetch service	Anti-detection	None — page content only	Raw HTML/markdown	Per-page
Browser-as-a-service (no stealth)	None	Full (LLM-driven)	LLM-extracted	Per-step or flat
Stealth Browser Agent	Anti-detection + residential proxy	Full (LLM-driven)	Structured JSON	Per-step

The stealth-only and interaction-only approaches each solve half the problem. Stealth fetch services return rendered content from defended pages but can’t interact — if you need to click a button before the data appears, you’re stuck. Browser-as-a-service tools provide LLM-driven interaction but use standard browsers that get fingerprinted on defended sites.

This Actor is the intersection. You pay per step for the complexity you actually use, and the two pricing tiers mean you’re not paying proxy costs on cooperative sites.

Pricing model

Per-step billing with two tiers. Each browser action step (click, type, navigate, extract) is one charge event:

Standard — no proxy. Covers the LLM planning call and browser action execution.
Proxy — residential proxy routing included. Higher per-step cost covers residential bandwidth. Activates when proxyGeo is set.

Failed steps are never charged. Only successful tool executions are billed. The extraction step (agent’s final JSON response) counts as one step.

Current per-event rates are on the Apify Store Pricing tab.

Open questions / future work

Context caching. The LLM receives the full conversation history each step. Caching the static prefix (system prompt + tool definitions + prior turns) would cut per-step LLM cost significantly for longer sessions. The economics are modeled but the implementation isn’t live yet.
Multi-action batching. Currently the LLM plans one action per turn. Allowing 2-3 actions at once would halve LLM calls for predictable sequences (e.g. filling 5 form fields). Reliability risk: if action 2 fails, actions 3-5 may be invalid.
Batch URL mode. The Actor handles one URL per run. A batch mode processing multiple URLs in parallel — sharing browser session setup cost — would improve economics for bulk extraction workflows.
History summarization. For long sessions (30+ steps), the growing conversation history increases per-step cost. Summarizing older turns while keeping recent ones verbatim would cap this growth curve.
Output schema enforcement. Currently the LLM is instructed to match the schema. A strict validation + retry layer would guarantee conformance for critical agent workflows.