← All posts

I Said My Bank-Statement Parser Was 100% Accurate. I Was Grading It Against Itself.

Last month I published that a 51-line parser read bank statements perfectly. Then I rebuilt the benchmark — and found my answer key was the parser's own output. Fixing that, I found two more broken answer keys. Three times the ground truth lost to the model. Here's what that taught me about trusting anything with money.

  • ai
  • ocr
  • finance
  • document-extraction
  • local-llm
  • evaluation

Last month I published a piece with a clean, satisfying number in it: a boring 51-line text parser read our bank statements with 100% accuracy — every transaction, to the cent — while the AI models I threw at the same job landed far behind. The lesson wrote itself: accuracy is a property of your pipeline, not your model.

The lesson still holds. The 100% did not. And the way it fell apart turned out to be the most useful thing I’ve learned all year.

When I rebuilt the benchmark to run a proper rematch against a stronger model, I discovered my parser’s “100%” was measured against an answer key that was the parser’s own output. I had graded the parser against itself. It scored 100% the way you ace a test when you also wrote the answer key.

So I went and got a real, independent answer key. And then a second one. And a third. Every single time my ground truth disagreed with a model, I opened the actual bank statement to settle it — and every single time, the statement sided with the model. Three different answer keys, four different defects — and every time one of them disagreed with the machine I was trying to grade, the machine turned out to be right and the answer key turned out to be wrong.

This is that story. It’s much less tidy than “boring parser wins,” and it’s the real lesson: in financial extraction, your ground truth is the weakest link in the system, and you almost never check it.

This is part one of two. Part one is the lesson — how I found my answer key was wrong four different ways, and what that means for trusting any extractor with money. Part two is the scoreboard: the full, clean, exact-cent comparison across every model I tested, once the corrected ground truth is rebuilt.

The two ways I was fooling myself

There were two soft validations propping up that original 100%, and they fail differently. Both are worth naming, because the second is a trap nearly everyone building financial extraction walks into.

Sin one — circular scoring (this produced the false 100%). To measure “exact-cent recall” you need a ground-truth list of the real transactions on each statement. I built that list by running the parser and trusting its output. So when I scored the parser against it, of course it got 100% — it agreed with itself. That’s not a measurement; it’s a mirror.

Sin two — reconciling against the statement’s own totals. My parser does something genuinely good: after it reads a statement it checks the arithmetic — opening + credits − debits = closing, every running balance consistent — and refuses if it doesn’t tie. That gate is real and I’d never ship without it. But I had quietly assumed something it doesn’t prove: that if the totals reconcile, the line items must be right. They don’t have to be. A totals check is necessary but not sufficient, and the gap between those two is exactly where silent errors live.

The fix for both is the same: an answer key that doesn’t come from the system you’re grading. Getting one — really getting one — was the whole project.

The errors a totals check can’t see

A reconciliation catches any error that changes the sum: a dropped transaction, an invented one, a wrong amount. Those break the tie and your gate fires. Good.

It is blind to errors that preserve the sum:

  • Splits — one $1,200 line read as two, $700 and $500. Sum unchanged.
  • Merges — two lines read as one. Sum unchanged.
  • Mis-dates — right amount, wrong day. Sum unchanged.
  • Mis-categories — right amount, wrong bucket. Sum unchanged.

Each leaves the totals tying to the penny while the ledger underneath is wrong. A statement can reconcile perfectly and still disagree with reality — and that’s before we even get to whether your answer key is right.

Three answer keys, four defects

Here is where the project stopped being a model benchmark and became something more uncomfortable.

Defect #1 — the parser graded itself. Already covered: the original 100% was circular. So I rebuilt the key from an external source.

Defect #2 — the bank’s own export is incomplete. The cleanest “independent” truth you can get is the bank’s own line-item export — a CSV of every transaction, straight from the source, no parser in the loop. I pulled all eight. Scored against that, the parser came back at 96.8%, not 100% — and all eight statements still reconciled on their printed totals, so roughly 3% of lines were wrong underneath a perfect-looking tie. Honest number, good story, I started writing it up.

Then I cross-checked one bank’s CSV against the statement PDFs and found 34 transactions that are on the PDF but missing from the CSV — including an $80,000 ACH deposit, an $11,525.75 loan payment, and a $9,831.46 withdrawal. The PDF reconciles to the printed control totals. The CSV doesn’t. The bank’s own “system of record” export had silently dropped real transactions. My independent oracle wasn’t ground truth either — it was just a different thing being wrong.

Defect #3 — the human labels are doubled. For one bank we had a hand-built due-diligence workbook — human-verified “golden” labels, the gold standard. On that held-out set, every model “miss” had a strange signature: the model was emitting exactly half the labeled amount. So I opened the statement pages to adjudicate, line by line:

  • A check dated 12/24: the statement says $1,340.51, the model read $1,340.51 — correct — and the label says $2,681.02. Exactly double. The label’s description literally reads 00000003088 00000003088 CHECK PAID — the reference number is duplicated; a row-merge had summed the row into itself.
  • A small ACH debit: statement $25.00, model $25.00, label $50.00. Doubled again.

All told, 122 rows carried exactly double the true amount — roughly $854,000 over-stated. And scanning for that duplicated-reference fingerprint alone missed some of them — the small recurring debits had no such fingerprint — which is precisely why reconciling to the bank’s control totals, not pattern-matching, is the detector you trust. The model was more accurate than our human ground truth.

Defect #4 — and the same workbook was incomplete on top of that. Doubling was only its first failure. The workbook also covered just two of the bank’s four accounts. The other two accounts — about 690 transactions, over a million dollars of volume each — were absent from the “gold” entirely. Not wrong amounts this time: missing accounts. And the way I caught it is the moral of the whole piece. The bank’s full four-account export self-reconciles — every running balance ties — and a gold set rebuilt from it ties to the printed totals to the cent across all four accounts. The human workbook had two accounts and doubled rows. The reconcile surfaced the gap; nothing short of it would have.

Three answer keys, four defects. A parser grading itself; a bank export with holes; and a human workbook wrong two ways at once — doubling the rows it listed and missing whole accounts it never listed. Every time one of them disagreed with the model, the statement sided with the model. That’s not a fluke about one model — it’s the structural fact of this domain: the ground truth is the part nobody validates, and it’s usually the part that’s wrong.

Why a near-right number is worse than a wrong one

Here’s why this isn’t academic. Our accounting software — the supposed system of record — had silently dropped a $28,138 ACH payment sitting right there on the statement. The software is ~93–99% “complete.” That one gap was a $28,000 hole in the monthly P&L, and nobody would have caught it, because the books still balanced against themselves. A system 99% right was 100% wrong about twenty-eight thousand dollars, and confident about it.

That’s money in one number. A 99%-accurate extractor isn’t “almost perfect” — it’s wrong about a few thousand dollars a month and confident. A contract read 99% right is a great summary; a ledger read 99% right is a silent error every hundred rows, and you don’t know which row.

So how did the models actually do?

Now the rematch — the reason I rebuilt all this. I scored a specialized financial OCR model and several strong general models against the same statements, same scorer, on exact-cent matching. One caveat up front, and it is the point of this whole piece: because two of my three answer keys turned out to be defective, I’m holding the absolute recall numbers until the bank re-exports cleanly and the human labels are reconcile-corrected — any absolute I published today would be scored against a contaminated key and would understate the models. What’s solid right now is the relative ranking on a given statement — every model scored against the same key with the same scorer. But here’s the twist I’ll come back to: that ranking flips depending on the statement’s layout, and that turns out to be the most important result of all. (The clean, full cross-layout scoreboard is part two of this series.)

Here’s an unseen regional bank (call it Bank A) that nothing was coded or trained for. The discriminator is precision — it punishes a model for emitting page noise and running balances instead of just the transactions:

EngineRecallPrecisionF1What it does
RolmOCR (a finance-specialized fine-tune of Qwen2.5-VL-7B)99.6%80%89.0%extracts transactions only
Qwen2.5-VL (general)99.9%51%67.1%dumps every number
MinerU 3.4 (pipeline / PP-OCRv6)98.4%50%66.3%dumps every number
naive “grab every dddd.dd” floor~100%50%~66%zero intelligence
the 51-line deterministic parser0has no code for this bank

(Bank A here is the full four-account, ~4,223-transaction bank — the same numbers as the part-two scoreboard. The specialized model’s precision settles to ~80% across that volume; on one clean statement it had looked like 92%, which is exactly why you don’t quote a one-statement number.)

Read that table slowly, because three things in it matter more than any single number:

The general models collapse to the dumb floor. A script that blindly grabs every dollar-shaped number scores ~100% recall and 50% precision. The big general VLMs — and MinerU — score the same, because they emit the transactions and every running balance and every stray figure on the page. High recall, no judgment. Only the task-specialized model (RolmOCR) extracts the transactions and nothing else, which is why — on this layout — it’s the only one with precision worth anything.

But — and this is the part I almost got wrong — that is one layout. It would be easy to read this table as “specialized beats general, newer isn’t better, just ship the specialized model.” That is exactly the extrapolate-from-one-bank mistake this whole piece is about, and I nearly published it. Bank A’s statements carry a running balance on every single line — which is precisely what the general models choke on, because they faithfully dump every balance. On a cleanly sectioned statement, where there’s no inline balance to dump, the picture changes: the general model holds its own, and on some layouts it beats the specialized one (which starts over-transcribing balance grids of its own). The honest result is that no single model wins across layouts — the best extractor depends on the statement in front of it. The full cross-layout matrix is part two. The durable lesson from part one is that you cannot trust any single model’s raw output — which is the entire reason for the gate. (And ignore any “MinerU scored 99%” headline you’ve seen, including in my own earlier draft — that figure came from a saturated recall metric where a dumb “grab every number” floor also scores ~99%. On the honest exact-cent metric MinerU is unremarkable, and its text-dump output can’t pass the arithmetic gate at all.)

The hand-coded parser scores zero on an unfamiliar bank — and that’s the case for having a model at all. My beloved 51-line parser has no code for Bank A, so it returns nothing — zero. A model at least reads the unfamiliar statement cold, with no per-bank code. That’s the real argument against a brittle pile of per-bank parsers: a parser is perfect exactly where you’ve already done the work and useless everywhere else, while a model degrades gracefully onto layouts it has never seen. Which model to reach for is — again — layout-dependent, and the entire point of the gate is that you don’t have to bet the ledger on getting that choice right.

The fine-tune: an honest negative

I tried to push RolmOCR to 99% by fine-tuning it on the bank’s own pages (a LoRA adapter, three epochs, ~190 page-examples). It did not beat the base model: 96.3% F1 tuned versus 96.7% base. The adapter is real — the outputs genuinely differ — so this is a true negative, not a no-op.

Why flat? The base model was already reading the bank near-perfectly before training started; there was nothing to learn at the format level, so the adapter mostly overfit. But the better reason is the punchline of this whole piece: part of the “error” I was training against was the doubled human labels. Fine-tuning harder wouldn’t have made the model more accurate — it would have taught the model to double the amounts, to reproduce my defect. The fine-tune staying flat is the model refusing to learn my bad ground truth. You cannot out-train a broken answer key. You have to fix the key first.

The corrected rule: a ladder, and a gate that outranks all of it

Last month I called the rule “reconcile-or-refuse.” Right, but underspecified. Here’s the honest version as a ladder:

  1. Floor — totals reconcile to the cent, or refuse. Always. Catches every sum-changing error, costs nothing.
  2. Bar — line items reconcile to an independent ledger, or refuse — when you can get one, and after you’ve checked that the ledger itself is complete (mine wasn’t).
  3. No trustworthy independent ledger? Report the totals-bound and flag the line items as unverified. Never quote a number you graded against yourself — or against a key you haven’t audited.

But the deepest version isn’t a number at all. The hero of this story is not the parser and not the model — it’s the arithmetic reconcile gate. It’s the only thing in the entire pipeline that caught both the model’s stray running-balance and the human’s doubled check, because it doesn’t trust anybody’s output — it checks every number against the statement’s own internal math. The model is the part you swap next quarter. The independent ledger is a thing you still have to audit. The gate is the part that doesn’t care who’s lying to it.

Why this matters more for models, not less

A deterministic parser fails loudly — no code for a bank, it scores zero, obviously useless, you find out in one second. A model fails quietly — it emits a plausible, confident, wrong number, and you find out three weeks later when the books don’t match the bank. That asymmetry is exactly why the independent-oracle gate matters more for models. The only thing that lets you put a 99% model near money is a check that tells you which 1% to refuse. Without it, “99% accurate” just means “silently wrong, somewhere, and you’ll learn where the expensive way.”

None of this is “models are bad.” A specialized model read an unseen bank cold and beat every general one and my hand-coded parser’s coverage. Use one — behind the gate, never in front of the money. As the sole source of truth for a ledger: no.

exact setup, the four ground-truth defects, and what's held vs publishable give me the detail

What changed since the first version. The first benchmark scored every extractor against a key derived from the deterministic parser’s own output — circular, so the parser’s “100%” was self-graded. The rebuild scores every extractor (the parser, RolmOCR, Qwen2.5-VL, MinerU 3.4) against external keys with one identical scorer: exact-cent match of every dddd.dd amount, on absolute value so signed/unsigned conventions can’t fool it. Eight real production statements, two-plus U.S. banks, thousands of transactions.

Why absolutes are held. Two of the external keys are provably defective: (a) one bank’s own CSV export is missing 34 real transactions that appear on the statement PDF and reconcile to its control totals (incl. an $80,000 ACH deposit) — a full re-export is pending; (b) a human DD label set has two defects — a row-merge doubling (122 rows carrying exactly 2× the true amount, ~$854k over-stated; model “misses” land on exactly half the label) and an account-coverage gap (it included only 2 of the bank’s 4 accounts; ~690 transactions across the two missing accounts were absent from the “gold”). Both were caught by reconciling a rebuilt gold to the printed control totals — which ties to the cent across all four accounts — not by pattern-matching, which missed the doublings that carried no duplicated-reference fingerprint. Until the CSV re-exports and the gold is reconcile-cleaned, any absolute recall/precision would be scored against a contaminated key and would understate the models, so only the relative, same-key ranking is reported here. It does not change when the clean absolutes land.

The relative ranking is layout-dependent — that’s the key result. On an unseen bank whose statements carry an inline running balance on every row (Bank A), precision is the discriminator: the specialized RolmOCR (Reducto’s finance fine-tune of Qwen2.5-VL-7B) holds ~80% precision across four accounts while the general VLMs (Qwen-family) and MinerU 3.4 (pipeline / PP-OCRv6) collapse to ~50% — they dump every running balance. But on cleanly sectioned statements (no inline balance to dump) that gap closes and reverses — a general VLM matches or beats the specialized model, which there over-transcribes the balance grids. So no single model wins across layouts; the full, validated cross-layout matrix (with a gate-pass column) is part two. The deterministic parser scores 0 on any bank it has no code for. Absolutes held pending final reconcile-validation of the matrix.

The fine-tune. RolmOCR + LoRA, 3 epochs, ~190 page-examples, 19 held out: 96.3% F1 tuned vs 96.7% base — a real negative (adapter outputs differ). Base was already at ~0.987 token accuracy at step 1; nothing to learn at the format level, and part of the residual “error” was the doubled labels, so harder training would have taught the doubling defect. Path to 99% is the reconcile gate + a cleaned multi-bank training set, not narrow fine-tuning.

Incumbent. pdftotext -layout → 51-line bank-aware parser → arithmetic reconciliation. ~96.8% line-item vs the (incomplete) bank CSV; 0 on banks it has no code for. (All documents are real financial records; every published figure is an aggregate or an entity-free illustrative amount — no real names, accounts, or entity-tied figures appear.)

What I actually learned

The first time, I learned a real thing — pipeline beats model — from a benchmark quietly grading a system against itself. Fixing that, I learned the bigger thing: I had no idea how wrong my ground truth was. A parser that graded itself, a bank export with holes, and a human workbook wrong two ways at once — doubling rows and missing whole accounts — and a model that was right every time I bothered to check.

If you build anything that touches money, the takeaway is not “trust the model” and it’s not “trust the parser.” It’s: be most suspicious of your cleanest number, because there’s a real chance you wrote its answer key. Audit the ground truth before you grade anything against it. Spend your effort on the reconcile gate — the one component that trusts no one and checks every number against the statement’s own math. The model is the part you replace next quarter. The gate is the part that lets you sleep.

Part two is the scoreboard this piece deliberately withholds: the full, clean, exact-cent comparison across every model — the specialized OCR fine-tune, the general vision-language models, and the document-AI pipelines — once the ground truth is rebuilt to something I’d actually stake a number on. Coming next.

Related: A 51-Line Parser Beat a 3-Billion-Parameter Model · Field-Level Ensemble OCR. Full methodology and per-statement detail available on request; the test documents are real financial records and aren’t published.