Enforcement Record Profiler

Design notes for an Actor that compiles a cited, multi-agency enforcement and accountability record for any company from official U.S. public records.

Design notes for enforcement-record-profiler, an Apify Actor that compiles a company’s enforcement and accountability record from official U.S. public records. Give it an organization name; it resolves the entity, searches federal courts and several enforcement agencies, and returns a single structured report in which every finding links to the official document it came from.

The Apify Store page is authoritative for the schema, current pricing, and the input form. This page is for the design questions — why every claim has to carry a citation, why the same tool is aimed at both investigative journalists and corporate-risk analysts, and what it deliberately refuses to do.

What this is

You submit one organization name (optionally with a disambiguating hint or ticker). The Actor:

  1. Resolves the entity to a canonical identity — official name, known aliases, and SEC CIK + stock ticker when the organization is a public registrant — so findings attach to the right company rather than a same-named one.
  2. Searches official public records across seven sources: federal court dockets and opinions (via CourtListener/RECAP), the U.S. Department of Justice, the Securities and Exchange Commission, the Federal Trade Commission, OSHA, the NLRB, and the EPA’s federal environmental-enforcement docket.
  3. Extracts normalized findings from those official pages — each with the source agency, a title, a date, a penalty or settlement amount where one is shown, and the exact source URL.
  4. Synthesizes a neutral summary of the record across sources, describing what the documents state.

Output is one dataset record per run, a human-readable Markdown report in the run’s Key-Value Store, and the raw findings as JSON. A per-source coverage block records which sources were queried, how many findings each produced, and any that were unavailable.

The phrasing that captures it for a search: compile a company’s regulatory and legal history, find the federal lawsuits and agency actions against a company, a cited corporate-accountability report — produced in one call instead of a multi-site afternoon.

Why I built it this way

Citations are the product, not a feature

The tempting version of this tool is a one-line prompt: “what has Company X done wrong?” pointed at a general model. It produces fluent prose and, often, a fabricated settlement, a wrong penalty figure, or a citation URL that resolves to nothing. For accountability work that output is worse than useless — it’s a correction waiting to happen, or a libel exposure.

So the design inverts it. The Actor fetches the official agency and court pages first, and the language model’s only job is to extract what is literally present on those pages and attach the real link to each finding. The summary is written from the cited findings, never from model memory. When a source has nothing, the report says so. The model is wrapped so that fetched page content is treated as data, not instructions — a results page can’t talk the extractor into inventing a record. The result is a document where every line traces to a government source you can open and check.

One capability, framed for two buyers

The same underlying capability — “compile an organization’s misconduct record from public data” — has two audiences that usually get sold separate products. Investigative journalists and watchdog researchers want the accountability narrative; ESG, vendor-risk, litigation-support, and diligence analysts want the same facts as a risk input. The expensive incumbents serve the second group through seat-licensed compliance suites and ignore the first.

I deliberately kept the tool buyer-neutral: it compiles the record and cites it, and it’s equally a reporting aid and a risk input. The output doesn’t editorialize, which is exactly what both audiences need — the journalist adds the narrative, the analyst adds the risk judgment.

Official sources only, and a careful line on courts

Scope is restricted to public records on organizations and public figures acting in an official capacity — never private-individual investigation. That’s an ethics line, not a technical one.

Court records are where it’s easy to drift across a terms-of-service line. Direct PACER scraping and state-court portals are off the table; federal dockets and opinions come through CourtListener and RECAP, which redistribute that data under their own terms. It’s a narrower court footprint than “everything,” but it’s a defensible one.

Why deterministic parsing, with the model only at the end

Most of these sources have no clean public API — they’re HTML search pages and RSS feeds, each with its own shape. There were two ways to turn them into normalized findings: have a language model read each page and extract the fields, or write a small deterministic parser per source. I went with deterministic parsers, and it follows directly from the citations-are-the-product rule above.

A finding has to be exactly what the page says. A parser that pulls the case number, date, and link out of a known table structure cannot invent a settlement or mangle a URL; a model asked to extract from free text can. Deterministic extraction also costs nothing per source — the only language-model call in an entire run is the final neutral summary, written from findings that are already cited and verified. So the model is kept to the one job it’s genuinely good at and safe for (readable synthesis of established facts) and away from the one job where a hallucination would become a fabricated record (extraction).

The cost of that choice is a parser per source, and a site redesign can break one. That’s real maintenance, but it’s contained: a broken parser degrades that one source — the run records it in the coverage block and proceeds on the rest — rather than silently emitting wrong findings, which is the failure mode I refuse to ship for accountability work.

Covering the two hard sources

Five of the seven sources resolve cleanly enough: federal courts (CourtListener/RECAP’s JSON API), DOJ (a press-release feed), FTC and OSHA (HTML search pages), and NLRB (an HTML case search behind an anti-bot wall the fetch layer clears). The other two — securities and environmental — took real work, and they’re why the record is more complete than a quick build would be.

The SEC has no per-company enforcement-search endpoint; its litigation releases and administrative proceedings aren’t in the filings full-text index. So that source reads the SEC’s enforcement RSS feeds — litigation releases, administrative proceedings, and accounting-and-auditing enforcement — and matches the entity name client-side. It covers the agency’s current enforcement window rather than full history, which the coverage block states plainly.

The EPA’s enforcement data sits behind an API that throttles every request per IP, which makes it unreliable as a live per-call dependency. Rather than drop environmental enforcement, the Actor queries a purpose-built index rebuilt weekly from EPA’s bulk enforcement dataset — so an EPA lookup is a fast, dependable call instead of a rate-limited gamble. The trade is freshness: environmental findings are current to the last weekly rebuild, not the last minute.

How to use it

Minimal input:

{ "entity": "Wells Fargo & Company", "sinceYear": 2015 }

Via the REST API:

curl -X POST "https://api.apify.com/v2/acts/shelvick~enforcement-record-profiler/run-sync-get-dataset-items?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"entity": "Wells Fargo & Company", "sinceYear": 2015}'

From a Python agent via the Apify SDK:

from apify_client import ApifyClient

client = ApifyClient(token=API_TOKEN)
run = client.actor("shelvick/enforcement-record-profiler").call(
    run_input={"entity": "Acme Corporation", "sources": ["courts", "doj", "ftc"]}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    for f in item["findings"]:
        print(f["source"], f["date"], f["title"], "->", f["url"])

It also surfaces on mcp.apify.com as a callable tool — one required field (entity), structured cited output — billable per call via x402 (USDC on Base) or Skyfire managed tokens, so an autonomous agent can run a profile and pay for it without a human in the loop. A full seven-source profile of a large entity can approach the five-minute sync-API cap — the NLRB and EPA lookups in particular trade latency for reliability — so use the async runs endpoint for those.

How it compares

Approach Multi-agency in one pass Per-claim citations Grounded in official records Agent-callable
Manual search across agency sites ✗ (one at a time) ✓ (you copy them)
General web-search LLM partial ✗ (often fabricated) partial
Subscription compliance/diligence suites ✗ (seat-licensed)
Enforcement Record Profiler

The combination that’s hard to find elsewhere is unified, cited, official-source-grounded, and callable per profile by a script or an agent — no seat license, no multi-site slog, no fabricated findings.

Pricing model

Pay-per-event: one charge per successful run, fired after the cited profile is pushed. It covers entity resolution, every selected source query, and the synthesis. Failed runs are never charged, and the platform startup charge is effectively zero. Subscriber tiers receive a discount. The flat-per-profile shape keeps the cost predictable for an agent caller even though the work varies with how many sources you select.

Current rates live on the Pricing tab of the Apify Store page.

Open questions / future work

  • EPA freshness. Environmental findings come from a weekly index rebuild — a deliberate trade of recency for reliability against an API that won’t take live per-call traffic. A shorter rebuild cadence, or an on-demand refresh for a single entity, would tighten that window.
  • Court depth. Federal dockets and opinions come through CourtListener/RECAP; state courts stay out of scope on terms-of-service grounds. Broader state-level coverage would need per-state sources with defensible terms.
  • Cross-source de-duplication. A single matter can appear in both a DOJ release and a federal docket. Today they show as two findings; collapsing them into one with multiple citations is a refinement.
  • Latency on the defended sources. NLRB sits behind a JavaScript bot-wall and EPA behind a throttled API; clearing the first reliably and querying the second both add time, so a heavily-covered entity across all seven sources can run long. The async endpoint handles it, but a first-class “this will take a while” signal for agent callers is worth adding.
  • Caching the slow agencies. The free agency endpoints rate-limit; a short-lived cache would smooth repeated profiles of the same entity.