Structured Extractor
Design notes for a hosted batch page-to-JSON Actor that returns schema-validated structured data to AI agents.
Design notes for structured-extractor, a hosted batch page-to-JSON extraction Actor on the Apify Store. Send a list of URLs and one JSON Schema describing the fields you want; it fetches them all in a single pass — escalating to a stealth browser and residential proxy when a page is defended — runs a language model per page to extract data matching your schema, validates it, and returns one record per URL.
The Apify Store page covers the input schema, current pricing, and how to try it. This page is for the design questions — why the Actor is shaped the way it is, what alternatives I considered, and what it deliberately refuses.
What this is
The problem class: an AI agent needs typed data from web pages — a product’s {title, price, in_stock}, a set of listings, records it can write straight to a database — but what it can get is the pages. So it fetches each one, receives tens of thousands of tokens of HTML or markdown, and parses that itself: more tokens, context-window pressure, hallucinated fields when data is sparse.
Structured Extractor is the layer that closes that gap, and it works on a batch. The decision it makes is narrow: given a list of pages and one target shape, produce data in that shape for each page, and report per page whether it actually conformed. It is not a scraper and not a crawler — it doesn’t discover links, and fetching is delegated to a stealth fetch path rather than reimplemented. What it owns is the extraction: turning fetched content into validated, typed JSON, one record per URL.
It takes a list rather than a single URL for a concrete reason covered below — the expensive part of fetching defended pages is the stealth-browser setup, and you want to pay that once for the whole batch, not once per page.
The output is one dataset record per input URL: the URL, a status (completed / failed / deferred), the extracted result, a schemaValid boolean, the token count, and an error string. schemaValid is the part I care most about — the difference between “here is some JSON, good luck” and “here is JSON that provably matches the contract you gave me, or an honest signal that it doesn’t.”
Why I built it this way
Separate extraction from fetching
I maintain an adaptive page-fetching Actor that returns clean content in a dozen-plus deterministic formats. The temptation was to bolt an LLM extraction mode onto it; I didn’t. Fetching is deterministic, cacheable, and model-free, with one cost profile. Extraction is probabilistic and model-driven, with a different cost profile and failure modes. Fusing them would make every fetch carry the option of an LLM bill and every extraction re-implement stealth. Keeping them as two Actors keeps each honest. Structured Extractor calls the fetch path internally, so the customer still makes one call — but the concerns stay separate.
Batch, because the cost that matters is the fetch setup
This is the design decision that shaped the input. The first version took a single URL per run. That’s the wrong unit for this tool: the pages people want structured are usually a set of similar pages, and many of them are defended, which means a stealth browser and a residential proxy. Spinning that up is the expensive, slow part — and a one-URL-per-run design pays it once per page.
So the Actor takes a list and fetches the whole batch in a single call to the fetch path, which launches one stealth browser and shares it across the URLs. The amortization is specifically on the fetch and on the Actor’s own cold-start — not on the extraction. The model still reads each page individually, so token cost scales with the number of URLs no matter what. What batching buys is fewer browser launches, fewer cold-starts, and one fetch round-trip instead of N. The latency win is bigger than the dollar win, and both move the right way.
The constraint that falls out: one shared outputSchema for the whole batch. Per-URL schemas would be incoherent and would defeat the “same fields from a list of similar pages” use case that motivates batching. If pages are genuinely different shapes, that’s separate runs.
A wall-clock budget, because of the 300-second cap
Synchronous and x402 calls cap at 300 seconds. A batch of stealth pages plus a per-page model call can blow past that. Rather than fail the whole run, the Actor takes a maxRuntimeSecs budget (default 270, under the cap). It works through the batch in order; when the deadline approaches, the URLs it hasn’t reached come back deferred — uncharged, and retry-friendly. So a batch that’s too big for one synchronous call degrades into “here’s what I finished, here’s what to retry” instead of an error. Asynchronous callers raise the budget for bigger batches.
Bounded input is the per-URL cost lever
The dominant cost is model input tokens — how much of each page the model reads. A long page can be 50,000+ tokens even after cleaning, making per-URL cost high and unpredictable. So the content bundle for each page is capped to a token budget, and each component of the bundle is capped independently — a pathological page with one gigantic structured-data blob can’t consume the whole window and starve the prose. The budget is a caller-visible input.
Validate and retry, rather than trust the model
The most common way these tools disappoint is silent non-conformance: you ask for {title, price} and get {name, cost}. So each page’s result is validated against your schema; if it doesn’t validate, that page is retried once with the specific validation errors fed back, and if it still doesn’t conform, schemaValid comes back false rather than pretending. I chose validate-and-retry over a provider’s native structured-output mode because callers send arbitrary JSON Schemas, and native enforcement across providers chokes on the long tail of schema features. Prompt the schema, check the result, retry on failure: that works for any schema.
Stealth, because degraded pages are the silent failure
Pages behind bot detection sometimes serve degraded content to suspected automation — wrong prices, missing stock. Routing the fetch through a stealth path with residential-proxy escalation means what’s extracted is the real page, not a bot-degraded copy, and it happens only when a page actually fights back.
How to use it
A realistic call: pull the same fields from a batch of product pages, validated against one schema.
{
"urls": [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
],
"outputSchema": {
"type": "object",
"properties": { "title": { "type": "string" }, "price": { "type": "string" } },
"required": ["title", "price"]
}
}
from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("shelvick/structured-extractor").call(
run_input={
"urls": [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
],
"outputSchema": {
"type": "object",
"properties": {"title": {"type": "string"}, "price": {"type": "string"}},
"required": ["title", "price"],
},
}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["url"], item["status"], item["schemaValid"], item["result"])
One record comes back per URL:
{
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"status": "completed",
"result": { "title": "A Light in the Attic", "price": "£51.77" },
"schemaValid": true,
"tokensUsed": 4200,
"error": null
}
Omit outputSchema and each page returns best-effort JSON. From an MCP-enabled agent, the Actor appears as a tool on mcp.apify.com and the input schema is self-documenting; pay per call via x402 USDC on Base or Skyfire managed tokens. Keep maxRuntimeSecs at its default for synchronous/x402 calls and let an oversized batch defer its tail, or use the async path for large batches.
How it compares to alternatives
| Approach | Stealth fetch | Structured to your schema | Conformance validated | Batch fetch amortization |
|---|---|---|---|---|
| Raw stealth fetcher | Yes | No — raw HTML/markdown | No | Depends |
| Model call on your own fetched HTML | No — you fetch | Yes | Usually not | No |
| Browser automation + hand-written selectors | Yes | Yes — you script it | Manual | You build it |
| Structured Extractor | Yes | Yes — JSON Schema in | Yes, per-page retry | Yes — one fetch per batch |
The raw-fetch and roll-your-own-model approaches each solve half the problem; hand-written selectors solve both but cost maintenance every time a site’s markup shifts. This Actor is the intersection, applied across a batch so the expensive fetch setup is paid once.
Pricing model
Pay-per-event, billed only on success — one charge per successfully extracted URL, after that URL’s record is pushed, covering the fetch, the extraction, validation, and any per-page retry. failed and deferred URLs are never charged, so a batch costs only the URLs it actually extracted. That matters with the runtime budget: an oversized batch that defers its tail only bills for the pages it finished.
Current per-event rates are on the Apify Store Pricing tab.
Open questions / future work
- Concurrent extraction. Pages are currently extracted in order under the runtime budget. Running the per-page model calls with bounded concurrency would fit more URLs under the 300-second cap before the tail defers — the obvious next throughput lever.
- Cheaper retries. A conformance retry re-sends the page; sending only the prior invalid JSON, the schema, and the errors would cut the retry’s cost sharply on large pages. The reason it re-sends today is correctness on “missing field” failures; measuring how often that helps comes first.
- Heterogeneous batches. One schema per run is a deliberate constraint. Per-URL schemas, or auto-grouping a mixed batch by page type, is conceivable but would muddy the cost story — not planned.
- Document extraction. The same validate-and-retry shape would work on PDFs and images. Different fetch path, multimodal model — a separate tool, not a mode here.
Published on the Apify Store as structured-extractor.