← All posts

I Benchmarked Three OCR Models on Real Bank Statements. The Best One Flipped With the Layout.

Part two of the bank-statement series: the clean scoreboard. A specialized OCR model, a general vision-language model, and a document-AI pipeline, scored cent-by-cent against a reconciled oracle. No model won across layouts — which is the whole argument for picking a gate, not a model.

  • ai
  • ocr
  • finance
  • document-extraction
  • local-llm
  • evaluation

This is part two of two. Part one was the lesson — how I found my answer key was wrong four different ways, and a model that was right every time I checked. It deliberately withheld one thing: the actual scoreboard. The numbers weren’t trustworthy yet, because two of my answer keys were broken. Now they’re fixed, the oracle reconciles, and I can show you the clean head-to-head.

The result is more interesting than “model X wins.” No model won across all the statement layouts. The best one flipped depending on what the statement looked like — and that flip is the entire case for building a reconcile gate instead of betting on a model.

The setup, honestly

Three extractors, one job: pull every transaction, to the cent, off real bank statements that span three very different layouts.

  • RolmOCR — a small vision-language model specialized for financial documents (a fine-tune of Qwen2.5-VL-7B).
  • Qwen2.5-VL — a strong general vision-language model, no finance specialization.
  • MinerU — a document-AI pipeline (layout detection + OCR), the kind of tool that tops document benchmarks.

The answer key this time is an oracle built from the statements’ own PDF text layer, reconciled to each statement’s printed control totals — it ties 8 out of 8 to the penny. Every model is scored against that same oracle, exact-cent, with the same scorer. (My deterministic 51-line parser reconciles all eight and is complete — but it reads the same PDF text the oracle is built from, so scoring it against that oracle is near-circular. It’s a baseline here, not an independent contestant. Part one already gave its honest independent number: 96.8% against the banks’ own exports.)

The banks are anonymized, but the layout is the load-bearing detail, so I’ll keep that:

  • Bank A — an inline running-balance layout: every transaction row also prints the account’s running balance.
  • Bank B — a cleanly sectioned layout (separate deposits/withdrawals sections) that also includes a Daily-Balances grid.
  • Bank C — a cleanly sectioned layout, no inline balances.

One honest note on weight before the numbers: every layout here is a multi-statement aggregate — Bank A spans four accounts and ~4,223 transactions, Banks B and C eight statements and ~1,863. Nothing below rests on a single statement, which is the point of a piece about not over-extrapolating from one.

The scoreboard

Recall / Precision / F1, exact-cent, against the reconciled oracle. Bold = the winner on that layout.

Bank A — inline running-balance (four accounts, ~4,223 transactions):

ModelRecallPrecisionF1
RolmOCR (specialized)99.680.489.0
Qwen2.5-VL (general)99.950.567.1
MinerU (pipeline)98.450.066.3

MinerU’s inline figure is across three of the four accounts (~3,866 transactions) — one account’s raw output was unavailable; it scores a flat ~50% precision on every inline account, so the fourth wouldn’t move it.

Bank B — sectioned + Daily-Balances grid:

ModelRecallPrecisionF1
Qwen2.5-VL (general)10094.697.3
RolmOCR (specialized)99.878.487.8
MinerU (pipeline)91.380.385.5

Bank C — sectioned:

ModelRecallPrecisionF1
Qwen2.5-VL (general)99.596.898.1
RolmOCR (specialized)99.595.997.7
MinerU (pipeline)78.090.083.6

Look at who’s bold. On the inline-balance layout the specialized model crushes the field. On both sectioned layouts a general model wins. The winner flips with the layout.

Why it flips

It’s not random, and the mechanism is the useful part.

On the inline-balance layout (Bank A), the general models drown in balances. Every row prints a running balance, and a general VLM faithfully transcribes everything it sees — transactions and balances alike. So it emits roughly twice as many numbers as there are transactions: ~100% recall, but ~50% precision, indistinguishable from a script that grabs every dollar-shaped number on the page. The specialized model was trained to emit transactions and skip balances, so it holds ~80% precision even across four accounts and four thousand transactions, where the others collapse to the ~50% floor.

On the sectioned layouts (Banks B and C), the advantage reverses. There’s no inline balance to drown in — so the general model’s discipline (“read the transaction sections”) wins, and it climbs into the mid-90s on precision. The specialized model, meanwhile, faithfully transcribes Bank B’s Daily-Balances grid as if those were transactions, and its precision falls to the high 70s. The exact instinct that saved it on Bank A — read every figure faithfully — sinks it on Bank B.

So: specialization buys you robustness on the ugly layout and costs you on the clean one. Generality is the opposite. Neither is safe by itself, and which one you’d have “picked” depends entirely on which bank you happened to test first — the same extrapolate-from-one-statement trap that ran through all of part one.

The model that’s out regardless: MinerU

One model doesn’t get to play the layout game. MinerU under-extracts — it drops real transactions (recall 78–91%, missing 9–22% of the ledger on some statements). And under-extraction is the one failure a downstream check cannot repair, which brings us to the metric that actually matters.

The only metric that survives the gate

A precision/recall table isn’t the product. The product is a pipeline that reconciles or refuses — it only ships a statement if the extracted transactions sum to the bank’s printed totals, and routes the rest to a human. So the question isn’t “what’s the F1,” it’s “how many statements come out the far end of the gate, tied, with no human touch?”

Statements that hit 100% exact-cent recall (so the gate can filter any extras and tie to the total):

ModelStatements that pass the gate clean
Qwen2.5-VL4 of 8
RolmOCR4 of 8
MinerU1 of 8

This reframes everything, because recall is the metric that survives the gate, and precision mostly isn’t:

  • Over-extraction is recoverable. When a model emits extra rows (balances, page noise), those extras are exactly the numbers that break the reconciliation — so a reconcile-driven filter removes precisely them, and the ledger ties. A messy, over-extracting model with high recall is gate-viable.
  • Under-extraction is fatal. A transaction the model never emitted can’t be added back by any downstream check. The sum can never reach the printed total. The statement refuses, forever, until a human keys it in.

That’s why both VLMs — despite their precision swinging from 51% to 97% across layouts — are gate-viable: their recall stays high, and the statements they miss, they miss by a transaction or two that the gate flags for a few seconds of human review. MinerU isn’t viable: it drops too much to ever tie.

So don’t pick a model. Pick a gate.

Here’s the synthesis of both pieces. Part one said: don’t trust your ground truth — it was wrong four different ways, and the only thing that caught all four was reconciling to the statement’s own totals. Part two says: don’t trust any single model either — the best one flips with the layout, and no fixed choice is safe.

Both arrows point at the same component. The reconcile gate is the unifier. It makes any high-recall extractor viable regardless of layout — it filters the over-extractor’s extras, refuses the under-extractor’s gaps, and it does it without caring which model produced the numbers or whether the “answer key” was right. You stop shopping for the model that’s accurate enough to trust, and you build the gate that makes accuracy checkable — at which point you can swap the model freely as better ones ship.

Do not pick a model. Pick a gate.

the oracle, the full matrix, the gate-pass metric, and how to read it give me the detail

The oracle. Built from each statement’s PDF text layer, reconciled to the printed control totals — ties 8/8 to the penny. This replaced two earlier keys that proved defective (a bank CSV that was incomplete, and a human workbook that was both doubled and missing accounts — see part one). It is the same source the deterministic parser reads, so the parser’s score against it is near-circular and reported only as “reconciles 8/8 + complete,” never as an independent number.

Models. RolmOCR (Reducto’s finance fine-tune of Qwen2.5-VL-7B); Qwen2.5-VL (general); MinerU 3.4 (pipeline / PP-OCRv6 backend). Metric: exact-cent match of every transaction amount, on absolute value, against the oracle — recall, precision, F1. This is an amount-multiset match (did you capture the right amounts), which is exactly what the reconcile gate checks. A stricter date-exact variant lowers RolmOCR by ~3 points on the highest-volume inline account but changes no ranking — read the F1s as amount-exact, not date-exact.

Full matrix (R / P / F1, micro-averaged / transaction-weighted). Bank A, inline running-balance — four accounts, n≈4,223 (MinerU over three accounts, n≈3,866) — RolmOCR 99.6/80.4/89.0 · Qwen2.5-VL 99.9/50.5/67.1 · MinerU 98.4/50.0/66.3. Bank B, sectioned + Daily-Balances grid — Qwen2.5-VL 100/94.6/97.3 · RolmOCR 99.8/78.4/87.8 · MinerU 91.3/80.3/85.5. Bank C, sectioned — Qwen2.5-VL 99.5/96.8/98.1 · RolmOCR 99.5/95.9/97.7 · MinerU 78.0/90.0/83.6. Banks B and C span eight statements (n≈1,863); Bank A spans four accounts (n≈4,223; MinerU over three, n≈3,866). All are multi-statement aggregates, and the rebuilt gold reconciles all four of Bank A’s accounts to printed totals to the cent.

Gate-pass (statements reaching 100% exact-cent recall, of 8): Qwen2.5-VL 4 · RolmOCR 4 · MinerU 1. The gate filters over-extraction and refuses under-extraction, so recall is the survivor metric; over-extraction (extra balances) is strictly safer than under-extraction (dropped transactions).

A note on the “independent” key. Even the bank’s own CSV export — the thing you’d reach for as ground truth — was wrong in both directions: on one bank it dropped 34 real transactions, on another it invented ~7 it didn’t have. An export that both omits and fabricates is not a source of truth; only reconciling to the statement’s printed totals is trustworthy in either direction.

(All figures are aggregates from real financial records; banks are anonymized to layout descriptors, and no entity-tied amounts appear.)

Related: Part one — I Said My Parser Was 100% Accurate. I Was Grading It Against Itself. · A 51-Line Parser Beat a 3-Billion-Parameter Model. Full methodology available on request; the test documents are real financial records and aren’t published.