Apify Discoverability Audit
Design notes for an Actor that grades Apify Actors against an evidence-grounded rubric for AI-agent discoverability.
Design notes for apify-discoverability-audit, an Apify Actor that scores another Apify Actor against an evidence-grounded rubric for AI-agent discoverability. The audit fetches the target Actor’s Store metadata, latest build (full README, input schema, dataset schema), and Store row (pricing model, agentic-payment whitelist flag), then runs eight deterministic check dimensions and returns a punch list of failed checks plus LLM-written suggestion text for everything that didn’t pass.
The Apify Store page covers the schema, current pricing, and how to try it. This page is for the design questions — why the rubric looks the way it does, why every status verdict is deterministic, and what the Actor deliberately refuses to grade.
What this is
Submit an Actor identifier (username/name or a 17-character Actor ID). Optionally pass a canonical-page URL (for off-platform cross-link checks). The audit returns one dataset record with:
- An overall verdict —
PASS/WARN/FAIL— derived from the worst-category status across eight rubric dimensions. - A per-dimension category summary — one status per dimension (agentic-payment eligibility, description quality, schema completeness, disambiguation, tool name, README structure, store hygiene, off-platform canonical signals).
- The full punch list — every individual check with its status, a short evidence line, and the dimension it belongs to.
- Suggestions — one copy-paste-ready improvement text per failed or warned check, written to the rubric’s concepts (six-component description framework, the five-hundred-character MCP truncation, the thirteen-section README template, the disambiguation block, PAY_PER_EVENT pricing).
- Warnings — any data-collection issues (private Actor, build endpoint outage, unreachable canonical page) surfaced separately from rubric verdicts.
The same record also lands as a readable Markdown punch list in the run’s Key-Value Store, served as text/plain so the bytes return verbatim (the KVS serve layer would otherwise inject a cookie banner script into HTML responses).
Why this exists
AI agents pick tools by reading their descriptions and schemas, not by browsing to a documentation site. The Apify MCP server truncates input-schema property descriptions at five hundred characters before appending enum values and examples — tools whose first five hundred characters don’t pre-load purpose, parameters, and constraints fail at the routing layer. Tools that lack explicit disambiguation guidance (“use this other tool instead for that other need”) lose to similar-sounding alternatives by default. Tools that aren’t on the PAY_PER_EVENT pricing model with isWhiteListedForAgenticPayments=true are invisible to agents using x402 or Skyfire payment rails.
There’s a body of recent research on this. Wang et al. 2026 (“From Docs to Descriptions”, arxiv 2602.18914) measured a two-hundred-sixty-percent lift in tool selection from standards-compliant descriptions. “Tool Preferences in Agentic LLMs” (arxiv 2505.18135) found greater-than-ten-times usage swings from description edits alone, across multiple model families. BiasBusters (arxiv 2510.00307) identified semantic alignment as the dominant tool-selection driver. Anthropic’s engineering guidance points at description quality and explicit disambiguation as the highest-impact surfaces for tool authors.
The rubric this Actor scores is the executable form of that evidence. Eight dimensions, weighted by what the research says matters; deterministic checks for every status verdict; LLM only for writing the suggestion prose.
Why every status verdict is deterministic
Tool-selection auditing is easy to do badly. The naive shape is “hand the Actor’s metadata to an LLM and ask if it’s good.” That works for a single pass but it’s unreproducible (the LLM picks different verdicts each run), unauditable (you can’t trace why a given dimension got a given status), and trivially fooled (a slick top-level description can mask a half-empty input schema).
This Actor runs every check as a pure function against the fetched metadata. The agentic-payment eligibility check reads pricingInfos[-1].pricingModel. The MCP-truncation check measures len(property.description) > 500. The disambiguation check applies a regex against the “What this doesn’t do” section heading and its body. The build-validator-gotcha check looks for items.enum on array properties (which Apify’s build validator hard-rejects).
The LLM gets invoked once per audit, and only to synthesize copy-paste-ready suggestion prose for already-failed checks. It receives the list of failed check IDs, the target Actor’s description and README excerpt (wrapped in NO_EXECUTE injection-protection tags), and the suggestion-seed hints from the checks. It returns a Pydantic-validated SuggestionsResponse structured-output object, which the pipeline then filters to drop any suggestion whose check_id isn’t in the actually-failed set. The model can’t invent dimensions, change verdicts, or score subjective categories — and the audit’s output is reproducible run to run.
How the rubric is weighted
Not all eight dimensions matter equally. The research is clear on which ones move the needle:
| Dimension | Weight | What it measures |
|---|---|---|
| Agentic-payment eligibility | Hard qualifier | PAY_PER_EVENT pricing model and isWhiteListedForAgenticPayments |
| Description quality | Highest | Six-component framework on the top-level description, every input property, every dataset field. Length within 500-char MCP truncation. No implementation-name leakage. |
| Schema completeness | High | Input schema present, every property has a type, examples or defaults, no items.enum on arrays |
| Disambiguation | High | README contains “What this doesn’t do” with explicit “use [X] instead for [Y]” prose |
| Tool name | High | Lowercase, hyphens not underscores, descriptive |
| README structure | Medium | Thirteen-section template, H1 present, “Calling from an AI agent” section, pricing-section generality |
| Store hygiene | Medium | Categories declared, seoTitle and seoDescription set in their bands |
| Off-platform canonical signals | Low (hygiene only) | Canonical page exists and links back to the Apify Store |
The “low weight, hygiene only” framing on canonical signals is deliberate. JSON-LD and llms.txt are no-cost to add, but Search Atlas’s December 2025 three-hundred-thousand-domain LLM-citation study found a null effect on citations, and Limy’s 2025 ninety-day bot-traffic log study found llms.txt fetched less than one percent of bot visits. The audit emits informational checks for those signals without scoring them — the work-to-payoff ratio doesn’t justify treating them as anything more.
Why the agentic-payment check has tri-state semantics
When an audit can’t fetch the target’s pricing information — most commonly because the Actor is unpublished or its Store row hasn’t indexed yet — the agentic-payment dimension reports WARN, not FAIL, and the agentic_payment_eligible field comes back as null rather than false. “Unconfirmed” is genuinely different from “confirmed non-PAY_PER_EVENT”, and propagating an unknown as a hard fail is the worst kind of misleading audit output.
Two paths feed the pricing lookup. The first is the authenticated Actor-object endpoint (/v2/acts/<slug>?token=...), which returns the full pricingInfos[] history whenever you own the target Actor — works for private Actors pre-publish, works for newly-published Actors whose Store row hasn’t indexed yet. The fallback is the public Store search (/v2/store?search=<name>), which works for any other published Actor but is fuzzy and popularity-ranked enough that recently-published Actors sometimes return zero hits. The audit tries both before reporting WARN; the warning makes the cause explicit.
How it compares to alternatives
| Approach | Description-quality rubric | Schema-completeness checks | Apify build-validator gotchas | MCP-truncation awareness | Disambiguation check | Suggestion text |
|---|---|---|---|---|---|---|
| Eyeballing your Actor against the docs | no | partial | no | no | no | no |
| Generic JSON-Schema linter | no | partial | no | no | no | no |
| Asking an LLM “is my Actor discoverable” with the README pasted | partial | no | no | no | partial | partial |
| Apify Discoverability Audit | yes | yes | yes | yes | yes | yes |
The differentiating axis is deterministic-first. Every status verdict comes from a pure function against the fetched metadata, so the audit reads what an MCP-mediated agent would actually read, and grades it against the same rubric a writing tool uses to author it.
For one-shot improvement of an existing Actor’s customer-facing copy, a writing skill that writes against the same standards is the better tool. For ongoing tracking of a portfolio’s discoverability over time, run this audit on a cron and diff the dataset records. For competitive landscape analysis against other published Actors in the same niche, use a dedicated Store-search workflow. For deep canonical-page rewrites, the off-platform canonical-signals dimension here only checks cross-link hygiene — actually rewriting a canonical page is a different job.
Pricing model
Pay-per-event, billed only on success. A single flat charge per successful audit, after the dataset record is pushed. Setting include_suggestions=false skips the LLM call but the audit charge is the same — the deterministic punch list is the costed work.
The platform-managed apify-actor-start event fires once per run at the platform minimum, so per-run overhead is effectively zero. Failed runs (invalid input, target Actor unreachable) don’t trigger the audit charge.
For the current per-event rate and any active subscriber discounts, see the Pricing tab on the Apify Store page.
Open questions / future work
A few things worth watching:
- Batch input. Right now the Actor takes one target per run. A batch mode (
actors: [...]) would make portfolio sweeps cheaper, but introduces a different cost model and the need for per-target failure isolation — non-trivial enough to wait until there’s real demand. - Auto-fixer mode. The audit already writes suggestion prose; a future variant could optionally push the suggested rewrites back into the target Actor’s source files via
apify push. Tempting but risky — agents auto-rewriting their own descriptions is one bad merge away from a degenerate fixed point. - Per-dimension weighting in the verdict. The current verdict is a worst-category-wins rollup. Some users may want a weighted-sum score where a single weak dimension doesn’t drag the whole verdict to FAIL. The Wang et al. 2026 paper has weights I could lift directly; worth doing once the rubric stabilizes against real usage.
- Heuristic improvements for the disambiguation regex. “use [X] instead for [Y]” is detected by three regex variants today. False positives are easy to imagine (
use the input field…); a small classifier trained on real examples would be more robust without losing transparency. - Audit the audit. A self-recursive run on the audit Actor itself is already part of the dogfood path, but a periodic CI job that runs the audit against the latest published build and fails on regressions would catch description-quality drift across edits — exactly the use case the Actor was built for, applied to itself.