SEC Financials Normalizer — EDGAR XBRL to Clean JSON
Design notes for a deterministic Actor that turns SEC EDGAR XBRL into standardized, identity-validated financial statements for AI agents.
Design notes for sec-financials-normalizer, a deterministic SEC-financials Actor on the Apify Store. Give it a ticker or a CIK; it pulls the company’s structured facts from SEC’s XBRL API, resolves each filer’s idiosyncratic tags onto a standard set of line items, and returns a clean income statement, balance sheet, and cash-flow statement — each line citing the tag it came from, each statement checked against accounting identities.
The Apify Store page covers the input schema, current pricing, and how to try it. This page is for the design questions — why the Actor is shaped the way it is, what I deliberately left out, and why the hard part isn’t the one people expect.
What this is
The data is already free. SEC publishes every public company’s financials as XBRL, and the companyfacts API hands back every fact a company has tagged. The problem this Actor solves isn’t access — it’s that raw XBRL is thousands of issuer-chosen tags, and the same economic line is tagged differently across companies and across years, so the facts aren’t comparable without real normalization work.
The decision the Actor makes on each company is narrow: for each standard line item — revenue, gross profit, net income, total assets, liabilities, equity, the three cash-flow subtotals — pick the right tag out of that company’s particular soup, take its value verbatim, and then prove the result against the company’s own reported subtotals. It is not a scraper (the data comes from an official API, not a page), and it is not an analyst (it returns statements, not ratios or opinions). What it owns is the mapping and the check.
The output is one record per company per fiscal year: the standardized statements, each line carrying the source tag it was drawn from and a flag for whether an accounting identity corroborated it, plus the identity residuals for the whole statement. A balanced balance sheet in the output isn’t me asserting the numbers are right — it’s the filer’s own Assets and Liabilities and Equity tags reconciling to the cent.
Why I built it this way
Deterministic, with no language model in the path
This is the first decision and the one I’d defend hardest. There’s an obvious temptation to throw a model at the tag soup and let it “figure out” the mapping. I didn’t, because the output is financial figures, and a model that occasionally emits a plausible-but-wrong number is worse than useless here — it’s a number someone might trade or report on. Every value this Actor returns is copied verbatim from an XBRL fact. The model-free path also means the cost is almost nothing (no tokens, no proxy — just SEC’s free API and a little compute), so the price reflects the value of the normalization, not a per-call inference bill.
The only computed value in the whole pipeline is total liabilities when a filer omits a standalone tag for it: then it’s assets minus equity, and it’s labeled derived rather than passed off as reported. One arithmetic fallback, flagged. Everything else is selection, not computation.
The concept-map is the product
What an LLM would have done at runtime, I did once, up front: encode the tag priority for each line item as a maintained map. Revenue tries the modern RevenueFromContractWithCustomerExcludingAssessedTax first, then falls back through Revenues and the older SalesRevenueNet. Equity prefers the figure including noncontrolling interests, because that’s the one that makes the balance sheet balance for a holding company with large minority stakes — pick the parent-only figure and a conglomerate’s sheet silently misses by the size of those interests.
That map is the thing that has to be maintained as US-GAAP taxonomies drift year to year, and maintaining it is exactly what makes the Actor worth renting instead of rebuilding. The code that fetches facts is trivial; anyone can write it. The judgment about which of a dozen revenue tags is the right one for this filer in this year is the work, and it’s ongoing work, which is the only kind of difficulty that doesn’t get competed away.
Accounting identities are self-contained ground truth
The clever part of normalizing financials is that you don’t need an external answer key. Every filer reports its own subtotals, so the books have to balance: assets equal liabilities plus equity, and for a normal company gross profit equals revenue minus cost of revenue. If I resolve the components and those identities hold, the filer’s own numbers have corroborated my mapping. If they don’t hold, something is wrong and the Actor says so in the residual rather than shipping a confident error.
A feasibility pass across a deliberately awkward set — a clean large-cap, a bank, an insurance holding company — is what convinced me this was buildable. The balance sheet reconciled exactly once equity used the incl-NCI tag; the one residual that showed up was a real thing (a redeemable, mezzanine-equity bucket that sits between liabilities and equity), and the check caught it automatically rather than hiding it. That’s the behavior I wanted: the validation surfaces the edge cases instead of burying them.
Sector-awareness, because financials break the template
The honest hard part isn’t the balance sheet — the identities nail that. It’s the lines that have no identity to check them against, and revenue is the worst offender. A bank or insurer has no gross-profit structure, and its total revenue isn’t the contract-with-customer line — for an insurer that tag is only a slice of revenue, and a normalizer that grabs it returns a confidently wrong total. There’s no accounting identity that catches this, so the only defense is knowing the sector. The Actor reads the filer’s SIC code, classifies it standard / financial / insurance, and applies the matching income-statement concept set — and skips the gross-profit check for financials rather than forcing it. Where a number can’t be self-checked, getting it right is a matter of encoded domain rules, which is precisely where the maintained map earns its keep.
Verbatim values, one flagged derivation, honest gaps
A line the Actor couldn’t resolve comes back missing, not zero and not guessed. The single derived value is labeled. Every reported value names its tag. The point throughout is that a caller — human or agent — can audit any figure back to its origin, and can tell at a glance which figures are corroborated by an identity and which are judgment calls the data couldn’t confirm. That transparency is more useful, for financial data, than a tidier-looking but opaque table.
How to use it
A realistic call: a watchlist of a few tickers, two years each, all three statements.
from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("shelvick/sec-financials-normalizer").call(
run_input={"identifiers": ["AAPL", "JPM", "BRK.B"], "years": 2}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["identifier"], item["fiscalYear"], item["sector"], item["status"])
You get one record per company-period. Each carries the standardized statements, the source tag on every line, and the identity residuals — so a downstream agent can both use the numbers and check them. From an MCP-enabled agent the Actor appears as a tool on mcp.apify.com, the input schema is self-documenting, and you pay per call via x402 USDC on Base or Skyfire managed tokens. It’s the kind of tool an agent reaches for when it needs comparable fundamentals for a handful of companies and shouldn’t be parsing raw filings to get them.
How it compares to alternatives
| Approach | Standardized line items | Sector-aware | Identity-validated | Per-line citation | Numbers |
|---|---|---|---|---|---|
| Raw EDGAR / XBRL scraper | No — raw tags | No | No | tag dump | verbatim |
| Roll-your-own XBRL parser | You build + maintain it | You build it | You build it | You build it | verbatim |
| LLM over raw filings | Sometimes | Sometimes | No | No | hallucination risk |
| SEC Financials Normalizer | Yes | Yes | Yes | Yes | verbatim |
Raw scrapers return the soup; rolling your own means owning the concept-map’s drift forever; an LLM over filings risks inventing figures. This Actor is the maintained-normalization layer with a self-contained correctness check, returning verbatim numbers — the difference between “here are the facts, good luck” and “here are standardized statements that reconcile to the filer’s own subtotals.”
Pricing model
Pay-per-event, billed only on success: one charge per company-period record pushed. A company that doesn’t resolve, or has no usable annual facts, triggers no charge — a run only ever costs the company-periods it actually normalized. One companyfacts fetch covers every requested year, so multi-year requests don’t multiply the fetch cost. Because there’s no model and no proxy in the path, the underlying cost is dominated by nothing in particular; the price reflects the normalization and the validation, not infrastructure.
Current per-event rates are on the Apify Store Pricing tab.
Open questions / future work
- An eval set for the un-checkable lines. The identities cover the balance sheet and gross profit for free, but sector revenue and the income-statement structure for financials have no self-check. The next real investment is a regression set that ground-truths those lines against the actual filings and runs in CI, so taxonomy drift can’t silently regress them.
- Quarterly periods. Annual 10-K only today. 10-Q periods are the obvious extension; the period-selection logic generalizes, the concept-map mostly carries over.
- More line items. EPS, shares outstanding, and a few more sub-totals are natural additions once the core set is hardened — added deliberately, each with its own identity check where one exists.
- The long tail of custom extension tags. Rare issuer-namespace tags with no us-gaap equivalent currently resolve to
missing. A careful mapper could pick some of these up, but only where it can be done without guessing.
Data and disclaimer
The figures come from companies’ own public filings on SEC EDGAR (XBRL), which is the authoritative source — every line cites its source tag and links to the filing. The normalization is best-effort and provided as-is: it can contain errors or omissions, particularly for unusual filers, restatements, or custom extension tags, and the lines without an accounting identity to self-check (notably sector revenue) are the ones to treat with most care. It’s data, not investment, financial, legal, or tax advice, and not a recommendation — verify any figure against the primary filing before relying on it. Every output record carries a short notice to the same effect, so an agent passing the data downstream carries the caveat with it.
Published on the Apify Store as sec-financials-normalizer.