Why we built a document sorter without AI. The Matrix by ChunkLand

A 92% accurate filing system is a 100% broken filing system. If eight out of every hundred bank statements get silently misfiled, you don't have a filing system. you have a search problem dressed in a confidence score. The numbers look great in the demo. The numbers stop being numbers the day you can't find a tax statement at 9pm on the night before the deadline.

The Matrix files documents on the machine that scanned them. There is no cloud model reading your paperwork to decide what it is. No tokens, no per-page billing, no "we use your documents to improve our service" clause buried on page nine of the terms. This is a deliberate choice. Below is the case for it. and the cases where it's the wrong tool.

The crowded field, and the line I drew through it

Document automation is a packed category. There are excellent products that parse line items off coffee receipts. There are heavy-duty cloud services that extract structured fields out of messy invoices at scale. They work. For someone whose job is reconciling thousands of expense reports a month they are absolutely the right tool.

What all of those products have in common is that they read your documents to do their job. They have to. The whole proposition is "give us your scan, our model will tell you what it is." That requires uploading the document, running OCR, running a classifier, storing the inference output somewhere. Pricing is per-page or per-API-call because every page costs them an inference. Accuracy is reported as a percentage because the underlying classification is probabilistic.

Probabilistic is fine when the cost of being wrong is low. A misclassified Uber receipt is a five-second fix. Probabilistic is not fine when the cost of being wrong is a tax statement filed under "school forms," a Medicare letter dropped into "utilities," or a supplier invoice that quietly never shows up when you go to do your BAS.

"The 92% accurate filing system is a 100% broken filing system."
. the framing this whole product exists to answer.

Deterministic, not probabilistic

Determinism is the angle a cloud model can't match without rebuilding. Every page has exactly two outcomes: it matches a recognizer I built from your real samples, or it parks in an UNKNOWN folder for you to look at. There is no third bucket called "I'm 87% sure this is an invoice." Your child's school excursion form cannot end up filed under "tax statements" because filing isn't decided by a model guessing at the page. it's decided by a pattern I confirmed against documents you sent me.

That sounds simple. The trick is making it simple. The way you get a deterministic runtime is to do the configuration work once, by hand, on a small batch of your actual paperwork. not to ask a model to figure it out fresh every time a new page lands.

How it actually works

You email me three to five of your everyday documents at setup@chunkland.com. Tax statements, bank statements, super statements, an invoice from your plumber, an electricity bill, your kid's school enrolment form, a Medicare letter, a timesheet you submit, a PO from a supplier. whatever your actual paperwork looks like. Don't curate. The messier the better.

I open each one and read it. Not a model. me, at a desk. I'm looking for the stable patterns: where the date lives, where the sender block is, what the account number format looks like, whether there's a barcode, what string is on the cover page that you'd want in the filename. Each pattern becomes a recognizer entry in your provision.json. Things like:

regex_text. a tight regex against the text near the top of a page. For an Australian tax statement, the recognizer locks onto the TFN block: TFN:\s*(\d{3}\s\d{3}\s\d{3}) with a co-occurring "Tax File Number" header, so a payslip with a TFN on it doesn't get misclassified as the statement itself.
layout_anchor. a positional anchor for documents whose key data sits in the same place every time. Bank statements are a good example: the BSB + account row reads BSB\s+(\d{3}-\d{3})\s+Account\s+(\d{6,10}) in the top-right header block, and the recognizer pins on that position, not on the bank's logo or marketing copy that changes every redesign.
regex_text against a supplier invoice header. e.g. Invoice\s+No\.?\s+(INV-\d{4,8})\s to pull the invoice number directly into the filename, so when you go looking for "INV-04821 from the plumber" it's literally that string on disk.
barcode_capture. Code 128 capture for documents that already carry a barcode (most super-fund statements, plenty of utility bills, a lot of work-order PDFs from clients). The recognizer reads what's already on your paper. It doesn't add anything to it.

I run each recognizer against your samples until the matches are clean and false positives are zero. Then I bundle the kit and email it back to you. Usually inside a day.

Once installed, your Matrix loads provision.json on startup and watches the folder you pointed it at. When a scan lands, each page is OCR'd locally. OCR is used to read text, never to decide what kind of document the page is. Each recognizer is tried in order. The first match wins; that decides the filename and the destination folder. If nothing matches across the whole set, the page parks in UNKNOWN/. Same input, same output, every time, with no probability attached.

That is the whole pipeline. It runs on your computer. It does not call my servers to classify a page. The only network call the app makes is a licence check against license.chunkland.com, which sends a licence key and a machine ID. not a single byte of your paperwork.

What you don't have to take on trust

The reason this matters. even if you don't think of yourself as a "privacy" person. is that personal paperwork is the most personal stuff most of us have. A bank statement lists every place you ate dinner last month. A super statement has your TFN on it. A medical letter has a diagnosis. The lease on your house has your signature. The right question to ask of any filing tool isn't "is the vendor trustworthy?". it's "what could the vendor leak about me, even by accident?"

When the answer is structurally "nothing, because the vendor never received it in the first place," the conversation is short. Your scans live on your laptop. They go from your scanner to a folder on your disk to the right named folder on your disk. They don't traverse the internet. They don't appear on someone else's server. The samples you email me at setup are a bounded, one-time exception. a handful of documents, kept for thirty days while I tune your recognizers, then deleted. They are not used to train anything, because there is nothing to train: the runtime is deterministic.

If you ever want to verify any of this, the recognizer engine is a single file you can read. You don't have to take my word for it.

When The Matrix is the wrong tool

This is the section product sites usually skip. I'm going to put it in the article you're reading.

If you want fields extracted, not just filing. The Matrix files the document. It doesn't read your supplier invoice and tell you the GST, the line items, or what to put in your BAS. It doesn't reconcile against your accounting software. If you want a tool that fills out spreadsheets for you, that's a different category of product and a fields product is what you need. The Matrix gets the right PDF into the right folder with a useful filename, and stops there.

The reason that matters: the moment the product starts trying to extract structured data from arbitrary documents, the only way to do it well across thousands of formats is to put a model in the runtime path. That's the trade I won't make.

If your paperwork is a one-off shoebox where every document is unique. The Matrix is built around the assumption that you receive similar-shaped documents repeatedly. the same telco bills every month, the same bank statements every quarter, the same handful of suppliers month after month. That covers almost everyone's personal admin. It doesn't cover "I scanned 4,000 pages of my late father's correspondence and none of it looks alike." For a one-time pile of genuinely unique documents, a model-based classifier is the right tool for the job, even with the privacy trade-off. because there are no repeating patterns for hand-tuned recognizers to lock onto.

If you don't want to send any samples at all. Tuning recognizers against your real paperwork is the whole reason this approach works. If sending three to five sample documents is a non-starter, the generic installer still runs. it just won't know your specific paperwork on day one, and you'll spend longer training it page by page.

What The Matrix does do is the boring, deterministic part most people actually need: the bank statement ends up in Banking/, the tax statement ends up in Tax/2026/, the school form ends up in Kids/School/, and they all have filenames you can read. Every time. With no model in the way. If that's the layer of your paperwork you're trying to fix, you're in the right place.

Try it on your own folder this week

The Matrix is A$29 a month, one tier, one person, one computer, unlimited scans. 7-day free trial. Cancel any month.

Email three to five of your real documents to setup@chunkland.com. I'll read them, tune the recognizers around what you actually file, and send back an installer that already knows your paperwork. You'll know inside a week whether it fits your life.

A$29 a month. Decide on your own paperwork.

Email three to five of your everyday documents and I'll mail you back an installer pre-configured around them. 7-day free trial. Cancel any month from the billing page.

Email your sample docs