Structured Data Extractor — URL to JSON

Design notes for a hosted batch page-to-JSON Actor that returns schema-validated structured data to AI agents.

Design notes for structured-extractor, a hosted batch page-to-JSON extraction Actor on the Apify Store. Send a list of URLs and one JSON Schema describing the fields you want; it fetches them all in a single pass — escalating to a stealth browser and residential proxy when a page is defended — runs a language model per page to extract data matching your schema, validates it, and returns one record per URL.

The Apify Store page covers the input schema, current pricing, and how to try it. This page is for the design questions — why the Actor is shaped the way it is, what alternatives I considered, and what it deliberately refuses.

What this is

The problem class: an AI agent needs typed data from web pages — a product’s {title, price, in_stock}, a set of listings, records it can write straight to a database — but what it can get is the pages. So it fetches each one, receives tens of thousands of tokens of HTML or markdown, and parses that itself: more tokens, context-window pressure, hallucinated fields when data is sparse.

Structured Extractor is the layer that closes that gap, and it works on a batch. The decision it makes is narrow: given a list of pages and one target shape, produce data in that shape for each page, and report per page whether it actually conformed. It is not a scraper and not a crawler — it doesn’t discover links, and fetching is delegated to a stealth fetch path rather than reimplemented. What it owns is the extraction: turning fetched content into validated, typed JSON, one record per URL.

It takes a list rather than a single URL for a concrete reason covered below — the expensive part of fetching defended pages is the stealth-browser setup, and you want to pay that once for the whole batch, not once per page.

The output is one dataset record per input URL: the URL, a status (completed / failed / deferred), the extracted result, a schemaValid boolean, the token count, and an error string. schemaValid is the part I care most about — the difference between “here is some JSON, good luck” and “here is JSON that provably matches the contract you gave me, or an honest signal that it doesn’t.”

Why I built it this way

Separate extraction from fetching

I maintain an adaptive page-fetching Actor that returns clean content in a dozen-plus deterministic formats. The temptation was to bolt an LLM extraction mode onto it; I didn’t. Fetching is deterministic, cacheable, and model-free, with one cost profile. Extraction is probabilistic and model-driven, with a different cost profile and failure modes. Fusing them would make every fetch carry the option of an LLM bill and every extraction re-implement stealth. Keeping them as two Actors keeps each honest. Structured Extractor calls the fetch path internally, so the customer still makes one call — but the concerns stay separate.

Batch, because the cost that matters is the fetch setup

This is the design decision that shaped the input. The first version took a single URL per run. That’s the wrong unit for this tool: the pages people want structured are usually a set of similar pages, and many of them are defended, which means a stealth browser and a residential proxy. Spinning that up is the expensive, slow part — and a one-URL-per-run design pays it once per page.

So the Actor takes a list and fetches the whole batch in a single call to the fetch path, which launches one stealth browser and shares it across the URLs. The amortization is specifically on the fetch and on the Actor’s own cold-start — not on the extraction. The model still reads each page individually, so token cost scales with the number of URLs no matter what. What batching buys is fewer browser launches, fewer cold-starts, and one fetch round-trip instead of N. The latency win is bigger than the dollar win, and both move the right way.

The constraint that falls out: one shared outputSchema for the whole batch. Per-URL schemas would be incoherent and would defeat the “same fields from a list of similar pages” use case that motivates batching. If pages are genuinely different shapes, that’s separate runs.

A wall-clock budget, because of the 300-second cap

Synchronous and x402 calls cap at 300 seconds. A batch of stealth pages plus a per-page model call can blow past that. Rather than fail the whole run, the Actor takes a maxRuntimeSecs budget (default 270, under the cap). It works through the batch in order; when the deadline approaches, the URLs it hasn’t reached come back deferred — uncharged, and retry-friendly. So a batch that’s too big for one synchronous call degrades into “here’s what I finished, here’s what to retry” instead of an error. Asynchronous callers raise the budget for bigger batches.

Bounded input is the per-URL cost lever

The dominant cost is model input tokens — how much of each page the model reads. A long page can be 50,000+ tokens even after cleaning, making per-URL cost high and unpredictable. So the content bundle for each page is capped to a token budget, and each component of the bundle is capped independently — a pathological page with one gigantic structured-data blob can’t consume the whole window and starve the prose. The budget is a caller-visible input.

That same input-token count is also what the pricing meters (see Pricing model) — so the cost lever and the billed unit are the same number, reported back on every record. Tightening the budget visibly lowers the bill; there’s no hidden second cost axis.

Validate and retry, rather than trust the model

The most common way these tools disappoint is silent non-conformance: you ask for {title, price} and get {name, cost}. So each page’s result is validated against your schema; if it doesn’t validate, that page is retried once with the specific validation errors fed back, and if it still doesn’t conform, schemaValid comes back false rather than pretending. I chose validate-and-retry over a provider’s native structured-output mode because callers send arbitrary JSON Schemas, and native enforcement across providers chokes on the long tail of schema features. Prompt the schema, check the result, retry on failure: that works for any schema.

Stealth, because degraded pages are the silent failure

Pages behind bot detection sometimes serve degraded content to suspected automation — wrong prices, missing stock. Routing the fetch through a stealth path with residential-proxy escalation means what’s extracted is the real page, not a bot-degraded copy, and it happens only when a page actually fights back.

How to use it

A realistic call: pull the same fields from a batch of product pages, validated against one schema.

{
  "urls": [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
  ],
  "outputSchema": {
    "type": "object",
    "properties": { "title": { "type": "string" }, "price": { "type": "string" } },
    "required": ["title", "price"]
  }
}

from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("shelvick/structured-extractor").call(
    run_input={
        "urls": [
            "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
            "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
        ],
        "outputSchema": {
            "type": "object",
            "properties": {"title": {"type": "string"}, "price": {"type": "string"}},
            "required": ["title", "price"],
        },
    }
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["url"], item["status"], item["schemaValid"], item["result"])

One record comes back per URL:

{
  "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
  "status": "completed",
  "result": { "title": "A Light in the Attic", "price": "£51.77" },
  "schemaValid": true,
  "tokensUsed": 4200,
  "error": null
}

Omit outputSchema and each page returns best-effort JSON. From an MCP-enabled agent, the Actor appears as a tool on mcp.apify.com and the input schema is self-documenting; pay per call via x402 USDC on Base or Skyfire managed tokens. Keep maxRuntimeSecs at its default for synchronous/x402 calls and let an oversized batch defer its tail, or use the async path for large batches.

How it compares to alternatives

Approach	Stealth fetch	Structured to your schema	Conformance validated	Batch fetch amortization
Raw stealth fetcher	Yes	No — raw HTML/markdown	No	Depends
Model call on your own fetched HTML	No — you fetch	Yes	Usually not	No
Browser automation + hand-written selectors	Yes	Yes — you script it	Manual	You build it
Structured Extractor	Yes	Yes — JSON Schema in	Yes, per-page retry	Yes — one fetch per batch

The raw-fetch and roll-your-own-model approaches each solve half the problem; hand-written selectors solve both but cost maintenance every time a site’s markup shifts. This Actor is the intersection, applied across a batch so the expensive fetch setup is paid once.

Pricing model

Pay-per-event, billed only on success, and metered to page size. A successfully extracted URL carries two charges, fired after that URL’s record is pushed: a flat per-URL base (the fetch, extraction setup, validation, and any per-page retry) and a per-1,000-input-token charge that scales with how much page content the model actually read. failed and deferred URLs trigger neither, so a batch costs only the URLs it actually extracted — and an oversized batch that defers its tail only bills for the pages it finished.

The two-part shape is deliberate, and it took a second pass to get right. The first cut was a single flat per-URL price. But per-URL cost is almost entirely model input tokens, and that ranges more than tenfold across the page sizes the input budget allows — a small page and a near-cap page are not remotely the same unit of work. A flat price has to be set high enough to cover the large pages, which means small pages overpay and the heaviest pages still flirt with negative margin. Metering the variable layer directly fixes both ends at once: small pages get cheaper, large pages pay their real cost, and there’s no page size at which the economics invert. It also lines the bill up with the one lever a caller actually controls — maxInputTokens — so cost is something you tune, not something that surprises you. The tradeoff is that the exact price of a run isn’t known until the pages are fetched; maxTotalChargeUsd caps that uncertainty, and the per-record inputTokens makes every charge auditable after the fact.

Current per-event rates are on the Apify Store Pricing tab.

Open questions / future work

Concurrent extraction. Pages are currently extracted in order under the runtime budget. Running the per-page model calls with bounded concurrency would fit more URLs under the 300-second cap before the tail defers — the obvious next throughput lever.
Cheaper retries. A conformance retry re-sends the page; sending only the prior invalid JSON, the schema, and the errors would cut the retry’s cost sharply on large pages. The reason it re-sends today is correctness on “missing field” failures; measuring how often that helps comes first.
Heterogeneous batches. One schema per run is a deliberate constraint. Per-URL schemas, or auto-grouping a mixed batch by page type, is conceivable but would muddy the cost story — not planned.
Document extraction. The same validate-and-retry shape would work on PDFs and images. Different fetch path, multimodal model — a separate tool, not a mode here.

Published on the Apify Store as structured-extractor.