A 51-Line Parser Beat a 3-Billion-Parameter Model at Reading Bank Statements

On Tuesday I pointed a brand-new 3-billion-parameter AI model at one of our bank statements and watched it type the digit zero for eighty-four minutes.

The model was Baidu’s Unlimited-OCR — OCR is optical character recognition, software that reads text out of an image or PDF — and it had been public for exactly one day. It is built to swallow a whole multi-page document in one pass and hand back clean, structured text. We run a local-first pipeline that pulls transactions out of bank statements, so a new open-source model that promised to read any statement, any layout, scanned or not, was worth an afternoon.

The afternoon turned into a verdict. The boring incumbent — a 51-line text parser — beat the 3-billion-parameter model on every measure that mattered: it got 100% of the transactions to the cent, reconciled the totals, and did it in about a tenth of a second per statement on no GPU at all. The big model, even after I fixed my setup and gave it its best configuration, landed between 0% and 66% accuracy, took 10 to 40 seconds per page, and on one statement returned nothing but blank pages.

If you only take one thing from this: for anything involving money, accuracy is a property of your pipeline, not your model. I’ll show the numbers, the three setup traps I fell into (so you can skip them), and the rule I now build everything around — reconcile-or-refuse.

Why a model is so tempting here

The pitch writes itself. One model, every bank, every format. Hand it a PDF — crisp download or crooked phone photo of a scan — and get transactions back. No per-bank parsing code, no templates, scans handled for free.

The catch is that financial data has a property most documents don’t: one wrong digit is a failure, not a deduction. “$1,234.56” misread as “$123456” doesn’t lower a grade — it breaks a reconciliation and quietly poisons a ledger. The bar isn’t “looks right.” It’s every cent, every row, or flag it for a human.

The setup

Two extractors, same machine, same answer key.

The incumbent. For digital PDFs — the kind you download from online banking, which carry a real text layer underneath — we don’t OCR at all. We pull the embedded text with a standard tool (pdftotext) and run it through a small, bank-aware parser (the 51 lines). Then comes the part that actually matters: an arithmetic check — opening balance + credits − debits = closing balance, with every running balance consistent. If it doesn’t add up, it refuses to hand over the data. It is deterministic. It physically cannot invent a digit.

The challenger. Unlimited-OCR. We rendered each page to an image — which also stands in for the scanned-document case the parser can’t touch — ran the model, and pulled every number it produced.

What we tested on, honestly. Our ground-truth set is eight real production statements from two U.S. commercial banks, 8 to 32 pages each, 1,863 transactions, with a human-verified, transaction-by-transaction answer key. The deterministic parser was checked against all eight (and reconciled all 1,863 to the cent). Because the OCR model runs 10–40 seconds a page, I ran it on three representative statements — two from the first bank, one from the second, spanning 8 to 29 pages. Everything below for the model is those three; the head-to-head is on the statements both extractors actually saw. (The documents are real financial records, so every figure here is an aggregate — no real amounts, names, or accounts appear.)

The metric: exact-cent recall. Of the N transaction amounts in a statement, how many did the extractor recover to the cent? I picked the strictest possible metric on purpose, because it mirrors the job — a reconciliation needs every amount exactly right. This is much harsher than the “table structure” scores OCR papers usually report, which can hand out full marks for a perfectly-shaped table full of wrong numbers.

The three traps (the part worth keeping)

A one-day-old model is easy to hold wrong, and my first two attempts were wrong. Finding the right setup was most of the work.

Trap 1: the repetition loop. That 84-minute run of zeros? I’d left out the anti-repetition setting the model’s own documentation specifies. Models in this family degenerate into a loop without it. Copy the generation parameters from the model card exactly before you conclude anything about quality.

Trap 2: the metric lied to me. Once it ran properly, the output clearly contained the right numbers — but my score came back near zero. The bug was mine: my answer key stored debits as negative numbers, the model emitted them as positive, and I was comparing the two. A wrong measurement is worse than no measurement; it’s a confident wrong answer. Check your evaluation harness before you trust its verdict.

Trap 3: whole-document vs. page-by-page (the real lever). The model’s headline feature is reading an entire PDF in one shot. On dense statements, that hurt — too much crammed into one pass, and the small numbers blur. Feeding it one page at a time was dramatically better. This single change, not any exotic flag, was the biggest accuracy gain. Note what that means: its best results came from not using its marquee feature.

The results

Exact-cent recall, with the transaction count (N) behind each percentage:

Statement	Pages	N (txns)	Incumbent parser	OCR, whole-document	OCR, page-by-page (its best)
Bank A, #1	8	113	100% ✓ reconciles	58%	66%
Bank A, #2	8	118	100% ✓	9% (quit after page 1)	35%
Bank B, #1	29	329	100% ✓	42%	0% — blank pages

Speed and footprint:

	Incumbent parser	Unlimited-OCR
Per statement	~0.1 second	minutes (10–40 s/page)
Hardware	CPU, no GPU	7–14 GB of GPU memory
Result	100%, reconciles	0–66%, never reconciles

Three things in that table matter more than the headline gap.

It quit mid-document. On Bank A’s second statement — same bank, same format, same settings as the first — the whole-document mode stopped after page one and emitted layout-marker garbage. Same kind of input, wildly different behavior. For money, unpredictability is disqualifying before you even get to accuracy.

No single setup worked across both banks. The page-by-page configuration that scored 66% on Bank A returned an empty string for all 29 pages of the Bank B statement — I reproduced it twice, with no errors in the log — while the whole-document mode read that same statement fine at 42%. So you can’t pick one configuration and ship it. The right setting is bank-specific and brittle, which defeats the entire reason you’d reach for one universal model.

Even its best was 66% — and that was on a pristine 300-DPI render of a clean digital PDF. A real scan is noisier, so treat 66% as an optimistic ceiling, not a floor.

This isn’t “new model bad” — its own paper predicts it

I want to be fair, because Unlimited-OCR is a genuinely good model for what it’s for. Its paper reports about 94% on a standard document benchmark and ~90% on table structure. On a contract or a research paper, it’s impressive, and I used its documented generation settings and its better-scoring page-by-page mode — its strongest hand, not its marketing one.

Two facts explain the result exactly:

The authors flag it themselves. The paper states that numeric accuracy on long documents degrades because the multi-page mode runs at a resolution that “degrades small-text visibility.” Bank statements are wall-to-wall small numbers. The designers wrote down where it would struggle, and that’s precisely where I measured it struggling.
The wider literature agrees. Independent write-ups of this class of model put real-world financial-document accuracy around 75–80%, with documented digit hallucination — the “$1,234.56 → $123456” failure — and table misalignment driving roughly 30% of production breakages. My stricter every-cent metric just surfaces the same weakness more sharply.

A 90% structure score and a 100% reconciliation are not the same product. One gets the table’s shape right. The other gets the money right.

Reconcile-or-refuse

Here’s the rule the afternoon left me with, and the thing worth more than the benchmark.

You will not make a single OCR model trustworthy with money by finding a better model. Even a 97%-accurate one drops a row in thirty — unacceptable for a ledger. The trust comes from a pipeline that catches its own mistakes instead of a model that promises not to make them:

Use the cheap, deterministic path whenever you can. Most statements are digital PDFs with a real text layer; reading that text never hallucinates and costs nothing. Most teams skip straight to a GPU model they didn’t need.
Reserve OCR for genuinely scanned documents — and even then, never trust it alone.
Gate every number on arithmetic. Opening plus credits minus debits equals closing; running balances consistent; page subtotals match. If it reconciles, ship it. If it doesn’t, refuse — route it to a human. That’s reconcile-or-refuse: it converts any 80%-accurate extractor into “100% of what we ship is arithmetically self-consistent, with a known exception queue.”

The gate is the product. The model is a part you can swap next quarter. That’s why a 51-line parser beat a 3-billion-parameter model — not because it’s smarter, but because it’s checkable.

To be precise about the scope of that win: I did not pit the parser against truly scanned statements. There it scores zero and OCR is the only option — and that’s exactly the lane I’d put OCR in, behind the reconcile gate. This benchmark is about structured digital PDFs, where reaching for a model is tempting and unnecessary.

When you SHOULD reach for a model like this

Prose-heavy documents — contracts, papers, reports — where it’s strong and a wrong digit isn’t a catastrophe.
Genuinely scanned documents with no text layer, where deterministic parsing gets you zero — but downstream of a reconcile gate, never in front of the money.
As one voice in an ensemble, cross-checked against a second extractor — the same field-routing idea I used to get OCR working on insurance cards. Never as the sole source of truth.

What it is not, today, is a drop-in replacement for a deterministic parser on structured financial PDFs.

exact configuration, environment, and how to reproduce the benchmark give me the detail

Model under test. baidu/Unlimited-OCR — 3.3B parameters, MIT license, DeepSeek-OCR lineage plus “R-SWA” (Reference Sliding Window Attention, the long-document mechanism), arXiv 2606.23050, released 2026-06-23. Pinned to the Hugging Face revision tested on 2026-06-24 (a one-day-old checkpoint can shift under you — date and pin it). Runtime: transformers==4.57.1, torch==2.10, one 24 GB GPU.

Harness (copy this shape):

Render each PDF page to an image with PyMuPDF at 300 DPI.
Run the model per page with the model card’s exact generation parameters — including the anti-repetition n-gram setting, or it loops forever (Trap 1).
Extract every dddd.dd-shaped amount with a regex.
Multiset-compare the extracted amounts to a hand-verified answer key, on absolute value so signed/unsigned conventions don’t fool you (Trap 2).
Report exact-cent recall, plus wall-clock per page and peak GPU memory.

Configs measured (exact-cent recall, absolute value):

Config	Bank A #1 (N=113)	Bank A #2 (N=118)	Bank B #1 (N=329)
Whole-document (`infer_multi`, base mode, 1024px)	58.4%	8.5%	41.9%
Page-by-page, base mode + plain-OCR prompt	66.4%	34.7%	0% (empty)
Page-by-page, gundam (crop-tiling) mode	—	23.7%	—

The dominant lever was per-page vs whole-document, not the crop mode. On these full-page registers a coherent full-page read (base) beat the tiled “gundam” mode the model recommends for dense text. The empty-output result on Bank B was reproduced twice with no errors logged.

Incumbent (Track A). pdftotext -layout → a 51-line bank-aware parser → arithmetic reconciliation (opening + Σcredits − Σdebits = closing, per-row running balance, page-subtotal checks). Verified on all 8 statements: 8/8 reconcile, exact transaction counts, ~0.03–0.12 s/statement, no GPU.

What I learned

A 3-billion-parameter model lost to 51 lines of text-parsing — not because the model is bad, but because I asked it to do the one thing its own paper says it’s weak at, in the one domain where 80% right is the same as wrong.

The two bugs that cost me hours are the cheap lessons: copy the model card’s generation settings exactly (or it loops), and make sure both sides of your evaluation use the same conventions (or it lies to you).

The expensive lesson is the rule. If you’re building anything that touches money, spend your effort on reconcile-or-refuse, not on the model leaderboard. The model is the part you replace next quarter. The gate is the part that lets you sleep.

Related: Field-Level Ensemble OCR · Self-Hosted Insurance-Card OCR. Full methodology and per-statement detail available on request; the test documents are real financial records and aren’t published.