Website Tech Stack Detector — Technographics by Domain

Design notes for a deterministic Actor that turns a list of company domains into normalized, evidence-backed technology profiles for AI agents.

Design notes for website-tech-stack-detector, a deterministic technographics Actor on the Apify Store. Give it a list of company domains; it fetches each site’s homepage, matches the markup against a signature database, and returns one clean JSON record per domain listing the technologies that site runs — the CMS, the ecommerce platform, the analytics and tag managers, the JavaScript framework, the CDN, the payment and marketing tools — each with the concrete evidence that identified it.

The Apify Store page covers the input schema, the current price, and how to try it. This page is for the design questions: why the Actor is shaped this way, what I deliberately left out, and where the real tradeoffs are.

What this is

“Technographics” is the unglamorous-but-valuable practice of knowing what software a company’s website is built on. It is one of the most reliable B2B sales signals there is: a vendor who sells a Shopify app wants the list of sites running Shopify; an agency wants prospects already paying for a marketing- automation tool they can displace; a competitor-research analyst wants to know whether three rivals all quietly migrated to the same CDN last quarter. The data is sitting in plain sight — every site declares most of its stack in the markup it serves to any browser — but turning “view source” into a clean, queryable dataset across a list of domains is the work.

This Actor does exactly that decision, per domain, and nothing more: fetch the homepage, walk every script and link URL, the generator meta tag, the framework markers, and (when the fetch exposes them) the response headers and cookies, and match them against a curated set of around 95 technology signatures spanning roughly twenty categories. Each match comes back with its category, a confidence level, and the exact string that triggered it. The output is one record per domain: the detected technologies, a rollup grouped by category, and a count for quick scanning.

It is not a crawler — it reads the homepage, not the whole site. It is not an analyst — it reports what it finds, with no firmographic overlay, no scoring, no “companies like this also use.” And it is not a guesser: there is no language model anywhere in the detection path. What it owns is the fetch and the match.

Why I built it this way

Deterministic, with no model in the detection path

Technographic data gets acted on — someone builds an outreach list from it, or decides where to spend ad budget. A fabricated entry is actively harmful, so the detection is pure signature matching: every reported technology is a verbatim match against the page’s own bytes, and the matching evidence is attached to each result so a skeptical user can audit it. No model sits in the path deciding what a site “probably” runs. This is the same call I make on the SEC tools and for the same reason — when the number or the fact matters, the extraction has to be exact, not plausible.

A pleasant consequence is the cost structure. With no inference bill and no residential-proxy bytes in the default path, the marginal cost of a profile is a fraction of a cent, so the price reflects the lookup itself rather than a model call. That is what lets the per-domain price sit comfortably below the subscription technology databases.

One homepage fetch per domain

I considered crawling several pages per site to catch technologies that only appear on, say, a checkout or a careers page. I didn’t, for two reasons. First, the overwhelming majority of a site’s stack is site-wide — analytics, tag managers, the CDN, the framework, the consent banner all load on every page, including the homepage. Second, one fetch per domain keeps the cost legible and the billing honest: one domain, one charge, and your domain-list length is your spend cap. A multi-page crawl would multiply cost for a long-tail of extra detections that mostly aren’t there. If demand proves otherwise, a deeper crawl is an additive option, not a rewrite.

Charge per successful fetch — even when nothing matches

A domain that is fetched cleanly but matches no signature is a valid answer: “this site runs none of the technologies I know how to detect.” That is useful information, and I paid for the fetch to produce it, so it is charged like any other success. What is never charged is a failure — a domain that can’t be reached or returns no usable page. The line is “did we successfully analyze the page,” not “did we find something,” because the latter would quietly punish the Actor for being honest about a clean site.

Basic-first fetching, and where I stopped

The fetch goes through a shared adaptive layer: it tries the cheap raw-HTML path first and escalates to a full JavaScript render only when a site blocks or thins the plain request. That ordering matters because raw HTML already contains the loader tags for almost every third-party technology — the analytics snippet, the consent script, the payment SDK are all referenced by a <script src> in the initial markup even on a heavy single-page app. So the common case stays nearly free, and the render is reserved for the sites that genuinely need it.

The interesting decision was where to stop. The fetch layer can go further — all the way to a residential, stealth-cleared session that defeats the hardest bot walls. I deliberately don’t use that tier here. At a per-domain price, a residential fetch costs more than the profile is worth, so paying for it would mean either losing money on every hard site or raising the price for everyone to subsidize a minority. Instead, a site that’s bot-walled beyond the rendered fetch simply fails — and a failure is free. That’s the honest tradeoff: this Actor reaches the large majority of public company sites cheaply, and the genuinely fortified ones return a clean “couldn’t reach it” rather than a silent, expensive, or fabricated answer.

A curated signature set, not an exhaustive one

The open technographic fingerprint databases track thousands of technologies, many of them long-tail. I started with a curated ~95 that cover the categories a sales or research user actually filters on — the major CMSes and ecommerce platforms, the analytics and tag managers everyone deploys, the mainstream frameworks, CDNs, payment SDKs, marketing-automation and chat tools, ad pixels, and consent managers. Each signature is kept specific on purpose: for sales-targeting data, a false positive (telling someone a prospect runs a tool they don’t) is worse than a miss, so I prefer a precise script-URL or generator-tag match over a broad keyword that might collide. The set is designed to grow; breadth is a backlog, precision is a principle.

How to use it

A realistic call: profile a watchlist of competitors’ sites.

{
  "domains": ["shopify.com", "stripe.com", "wordpress.org", "webflow.com"]
}
from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("shelvick/website-tech-stack-detector").call(
    run_input={"domains": ["shopify.com", "stripe.com", "wordpress.org", "webflow.com"]}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    cats = item.get("categories") or {}
    print(item["identifier"], "→", item["technologyCount"], "techs;",
          "Ecommerce:", cats.get("Ecommerce"), "Analytics:", cats.get("Analytics"))

A single record comes back like this:

{
  "identifier": "shopify.com",
  "status": "completed",
  "url": "https://www.shopify.com/",
  "domain": "shopify.com",
  "technologies": [
    { "name": "Shopify", "category": "Ecommerce", "confidence": "high", "evidence": "URL: https://cdn.shopify.com/..." },
    { "name": "Google Analytics", "category": "Analytics", "confidence": "high", "evidence": "URL: https://www.googletagmanager.com/gtag/js?id=G-..." }
  ],
  "categories": { "Ecommerce": ["Shopify"], "Analytics": ["Google Analytics"] },
  "technologyCount": 2,
  "realizedTier": "basic"
}

If a technology you expect is injected only after client-side scripts run, set deepRender: true to force a full render on every domain. Most runs don’t need it — the raw HTML carries the loader tags — but it’s there for SPA-heavy targets.

If you’re calling from an MCP-enabled agent, the same call is available as a tool on mcp.apify.com; the input schema is self-documenting, so the model can build the call from the tool description, and you can pay per call over x402 (USDC on Base) or Skyfire managed tokens. That’s the path I most expect for a sales- research agent enriching a prospect list mid-conversation.

How it compares to the alternatives

Approach Normalized categories Evidence per match Bulk by domain Agent-callable
Subscription technology-lookup services yes rarely yes via their own plan
Roll-your-own page parsing you build it you build it you build it you build it
Website Tech Stack Detector yes yes yes (1–100/run) yes

The honest framing: you can absolutely parse pages and maintain a signature set yourself — this Actor is for when you’d rather not own the parser, the signature database, and the fetch-escalation logic, and you want a stable JSON contract you can point an agent or a pipeline at. A subscription technographic database is the better fit when you need multi-year adoption history or a firmographic overlay and don’t mind a per-seat plan. Going domain-by-domain on demand, paying only for what you look up, with the evidence attached, is where this one fits.

Pricing model

Pay-per-event, billed only on success: one charge per domain fetched and analyzed, after the record is pushed to the dataset. Domains that fail to fetch — or that are too heavily bot-walled to reach — are free. Because billing is per domain, your domain-list length is your spend cap, and a clean site that matches nothing is still a paid, valid answer.

Current per-domain rates are on the Apify Store Pricing tab.

Open questions / future work

  • Deeper detection. The signature set is curated, not exhaustive. The obvious next step is broadening coverage of the long tail — more niche CMSes, regional ecommerce platforms, and B2B SaaS widgets — without loosening the precision that keeps false positives out.
  • Multi-page profiling. A homepage covers the site-wide stack, but checkout, blog, and careers pages can reveal payment, comment, and applicant-tracking tools the homepage doesn’t. Whether that belongs as an opt-in depth setting on this Actor or a separate crawl-then-profile flow is still open.
  • Version and change detection. Several signatures expose a version string; surfacing it, and diffing a domain’s stack across runs to flag migrations, would turn this from a snapshot into a monitor.
  • Header and cookie coverage. Header- and cookie-based signatures fire only when the fetch tier exposes them. Making that material reliably available would unlock cleaner detection of server-side technologies that leave no HTML trace.