Apify Discoverability Audit

Design notes for an Actor that grades Apify Actors against an evidence-grounded rubric for AI-agent discoverability.

Design notes for apify-discoverability-audit, an Apify Actor that scores another Apify Actor against an evidence-grounded rubric for AI-agent discoverability. The audit fetches the target Actor’s Store metadata, latest build (full README, input schema, dataset schema), and Store row (pricing model, agentic-payment whitelist flag), then runs eight deterministic check dimensions and returns a punch list of failed checks plus LLM-written suggestion text for everything that didn’t pass.

The Apify Store page covers the schema, current pricing, and how to try it. This page is for the design questions — why the rubric looks the way it does, why every status verdict is deterministic, and what the Actor deliberately refuses to grade.

What this is

Submit an Actor identifier (username/name or a 17-character Actor ID). Optionally pass a canonical-page URL (for off-platform cross-link checks). The audit returns one dataset record with:

  • An overall verdictPASS / WARN / FAIL — derived from the worst-category status across eight rubric dimensions.
  • A per-dimension category summary — one status per dimension (agentic-payment eligibility, description quality, schema completeness, disambiguation, tool name, README structure, store hygiene, off-platform canonical signals).
  • The full punch list — every individual check with its status, a short evidence line, and the dimension it belongs to.
  • Suggestions — one copy-paste-ready improvement text per failed or warned check, written to the rubric’s concepts (six-component description framework, the five-hundred-character MCP truncation, the thirteen-section README template, the disambiguation block, PAY_PER_EVENT pricing).
  • Warnings — any data-collection issues (private Actor, build endpoint outage, unreachable canonical page) surfaced separately from rubric verdicts.

The same record also lands as a readable Markdown punch list in the run’s Key-Value Store, served as text/plain so the bytes return verbatim (the KVS serve layer would otherwise inject a cookie banner script into HTML responses).

Why this exists

AI agents pick tools by reading their descriptions and schemas, not by browsing to a documentation site. The Apify MCP server truncates input-schema property descriptions at five hundred characters before appending enum values and examples — tools whose first five hundred characters don’t pre-load purpose, parameters, and constraints fail at the routing layer. Tools that lack explicit disambiguation guidance (“use this other tool instead for that other need”) lose to similar-sounding alternatives by default. Tools that aren’t on the PAY_PER_EVENT pricing model with isWhiteListedForAgenticPayments=true are invisible to agents using x402 or Skyfire payment rails.

There’s a body of recent research on this. Wang et al. 2026 (“From Docs to Descriptions”, arxiv 2602.18914) measured a two-hundred-sixty-percent lift in tool selection from standards-compliant descriptions. “Tool Preferences in Agentic LLMs” (arxiv 2505.18135) found greater-than-ten-times usage swings from description edits alone, across multiple model families. BiasBusters (arxiv 2510.00307) identified semantic alignment as the dominant tool-selection driver. Anthropic’s engineering guidance points at description quality and explicit disambiguation as the highest-impact surfaces for tool authors.

The rubric this Actor scores is the executable form of that evidence. Eight dimensions, weighted by what the research says matters; deterministic checks for every status verdict; LLM only for writing the suggestion prose.

Why every status verdict is deterministic

Tool-selection auditing is easy to do badly. The naive shape is “hand the Actor’s metadata to an LLM and ask if it’s good.” That works for a single pass but it’s unreproducible (the LLM picks different verdicts each run), unauditable (you can’t trace why a given dimension got a given status), and trivially fooled (a slick top-level description can mask a half-empty input schema).

This Actor runs every check as a pure function against the fetched metadata. The agentic-payment eligibility check reads pricingInfos[-1].pricingModel. The MCP-truncation check measures len(property.description) > 500. The disambiguation check applies a regex against the “What this doesn’t do” section heading and its body. The build-validator-gotcha check looks for items.enum on array properties (which Apify’s build validator hard-rejects).

The LLM gets invoked once per audit, and only to synthesize copy-paste-ready suggestion prose for already-failed checks. It receives the list of failed check IDs, the target Actor’s description and README excerpt (wrapped in NO_EXECUTE injection-protection tags), and the suggestion-seed hints from the checks. It returns a Pydantic-validated SuggestionsResponse structured-output object, which the pipeline then filters to drop any suggestion whose check_id isn’t in the actually-failed set. The model can’t invent dimensions, change verdicts, or score subjective categories — and the audit’s output is reproducible run to run.

How the rubric is weighted

Not all eight dimensions matter equally. The research is clear on which ones move the needle:

Dimension Weight What it measures
Agentic-payment eligibility Hard qualifier PAY_PER_EVENT pricing model and isWhiteListedForAgenticPayments
Description quality Highest Six-component framework on the top-level description, every input property, every dataset field. Length within 500-char MCP truncation. No implementation-name leakage.
Schema completeness High Input schema present, every property has a type, examples or defaults, no items.enum on arrays
Disambiguation High README contains “What this doesn’t do” with explicit “use [X] instead for [Y]” prose
Tool name High Lowercase, hyphens not underscores, descriptive
README structure Medium Thirteen-section template, H1 present, “Calling from an AI agent” section, pricing-section generality
Store hygiene Medium Categories declared, seoTitle and seoDescription set in their bands
Off-platform canonical signals Low (hygiene only) Canonical page exists and links back to the Apify Store

The “low weight, hygiene only” framing on canonical signals is deliberate. JSON-LD and llms.txt are no-cost to add, but Search Atlas’s December 2025 three-hundred-thousand-domain LLM-citation study found a null effect on citations, and Limy’s 2025 ninety-day bot-traffic log study found llms.txt fetched less than one percent of bot visits. The audit emits informational checks for those signals without scoring them — the work-to-payoff ratio doesn’t justify treating them as anything more.

Why the agentic-payment check has tri-state semantics

When an audit can’t fetch the target’s pricing information — most commonly because the Actor is unpublished or its Store row hasn’t indexed yet — the agentic-payment dimension reports WARN, not FAIL, and the agentic_payment_eligible field comes back as null rather than false. “Unconfirmed” is genuinely different from “confirmed non-PAY_PER_EVENT”, and propagating an unknown as a hard fail is the worst kind of misleading audit output.

Two paths feed the pricing lookup. The first is the authenticated Actor-object endpoint (/v2/acts/<slug>?token=...), which returns the full pricingInfos[] history whenever you own the target Actor — works for private Actors pre-publish, works for newly-published Actors whose Store row hasn’t indexed yet. The fallback is the public Store search (/v2/store?search=<name>), which works for any other published Actor but is fuzzy and popularity-ranked enough that recently-published Actors sometimes return zero hits. The audit tries both before reporting WARN; the warning makes the cause explicit.

How it compares to alternatives

Approach Description-quality rubric Schema-completeness checks Apify build-validator gotchas MCP-truncation awareness Disambiguation check Suggestion text
Eyeballing your Actor against the docs no partial no no no no
Generic JSON-Schema linter no partial no no no no
Asking an LLM “is my Actor discoverable” with the README pasted partial no no no partial partial
Apify Discoverability Audit yes yes yes yes yes yes

The differentiating axis is deterministic-first. Every status verdict comes from a pure function against the fetched metadata, so the audit reads what an MCP-mediated agent would actually read, and grades it against the same rubric a writing tool uses to author it.

For one-shot improvement of an existing Actor’s customer-facing copy, a writing skill that writes against the same standards is the better tool. For ongoing tracking of a portfolio’s discoverability over time, run this audit on a cron and diff the dataset records. For competitive landscape analysis against other published Actors in the same niche, use a dedicated Store-search workflow. For deep canonical-page rewrites, the off-platform canonical-signals dimension here only checks cross-link hygiene — actually rewriting a canonical page is a different job.

Pricing model

Pay-per-event, billed only on success. A single flat charge per successful audit, after the dataset record is pushed. Setting include_suggestions=false skips the LLM call but the audit charge is the same — the deterministic punch list is the costed work.

The platform-managed apify-actor-start event fires once per run at the platform minimum, so per-run overhead is effectively zero. Failed runs (invalid input, target Actor unreachable) don’t trigger the audit charge.

For the current per-event rate and any active subscriber discounts, see the Pricing tab on the Apify Store page.

Open questions / future work

A few things worth watching:

  • Batch input. Right now the Actor takes one target per run. A batch mode (actors: [...]) would make portfolio sweeps cheaper, but introduces a different cost model and the need for per-target failure isolation — non-trivial enough to wait until there’s real demand.
  • Auto-fixer mode. The audit already writes suggestion prose; a future variant could optionally push the suggested rewrites back into the target Actor’s source files via apify push. Tempting but risky — agents auto-rewriting their own descriptions is one bad merge away from a degenerate fixed point.
  • Per-dimension weighting in the verdict. The current verdict is a worst-category-wins rollup. Some users may want a weighted-sum score where a single weak dimension doesn’t drag the whole verdict to FAIL. The Wang et al. 2026 paper has weights I could lift directly; worth doing once the rubric stabilizes against real usage.
  • Heuristic improvements for the disambiguation regex. “use [X] instead for [Y]” is detected by three regex variants today. False positives are easy to imagine (use the input field…); a small classifier trained on real examples would be more robust without losing transparency.
  • Audit the audit. A self-recursive run on the audit Actor itself is already part of the dogfood path, but a periodic CI job that runs the audit against the latest published build and fails on regressions would catch description-quality drift across edits — exactly the use case the Actor was built for, applied to itself.