A 92% accurate filing system is a 100% broken filing system. If 8 out of every 100 contracts get silently misfiled, you don't have a filing system — you have a search problem dressed in a confidence score. The numbers look great in the demo. The numbers stop being numbers when the document you can't find is a signed deed of release.
The Matrix files documents on the machine that scanned them, with no machine learning, no OCR, and no cloud round-trip. This is a deliberate choice. Below is the case for it — and the cases where it's the wrong choice.
The crowded field, and the line we drew through it
Document automation is a packed category. Dext, Hubdoc, AutoEntry, and the long tail of receipt-capture apps are excellent at parsing line items off a coffee receipt. Rossum, Sensible, and AWS Textract are the heavy lifters when you need to extract structured fields out of unstructured invoices at scale. They work. We use some of them ourselves for tasks where they're the right tool.
What every one of those products has in common is that they read your documents to do their job. They have to. The whole proposition is "give us your scan, our model will tell you what it is." That requires uploading the document, running OCR, running a classifier, and storing the inference output somewhere. Pricing is per-page or per-API-call because every page costs them an inference. Accuracy is reported as a percentage because the underlying classification is probabilistic.
Probabilistic is fine when the cost of being wrong is low. A misclassified Uber receipt is a five-second fix. Probabilistic is not fine when the cost of being wrong is a privilege waiver, a rejected discovery production, a HIPAA breach notification, or a client wondering why their statutory declaration ended up in a folder labelled "delivery dockets."
— the framing this whole product exists to answer.
Deterministic, not probabilistic
Determinism is the angle the cloud-AI sorters cannot match without rebuilding. Our pipeline has exactly two outcomes per page: a cryptographically valid stamp that names the document and the page number, or no valid stamp at all. There is no third bucket called "the model is 87% confident this is an invoice." A page either has a verifying signature or it's flagged as UNKNOWN and routed to human review. A $10 million contract cannot be misclassified as a delivery docket because we don't classify by content — we read a signed pointer that the document carries with it.
That sounds simple. The trick is making it simple. To make classification deterministic on the read side, you need control of the print side. Once you have the print side, you stop guessing.
How it actually works
The Matrix installs a virtual printer on your machine. When you print a document — from Word, Xero, your case management system, anything — each page passes through The Matrix on its way to paper, and a small QR code is stamped in the top-right corner. The QR carrier is a standard QR. The payload inside the QR is ours.
That payload is 36 bytes of wire format, defined in src/codec.py:
- 1-byte magic + 1-byte version,
- 4-byte license ID derived from your license key,
- 8-byte hash of the document name,
- 2-byte page number (1-based),
- 4-byte minute-resolution timestamp,
- followed by 16 bytes of
HMAC-SHA256(license_key, scrambled_payload).
The first 20 bytes are XOR-scrambled with a SHA-256 keystream derived from the license key, so the payload is meaningless to anyone without your key. The HMAC is what actually prevents forgery. On the read side — decode() in the same file — the signature is verified in constant time using hmac.compare_digest. A wrong key, a flipped bit, a fabricated stamp from a competitor's printer, all fail the verify with a single CodecError. They never silently pass.
On scan, the reader (src/reader.py) renders each page at 200 DPI, looks in the top-right hint region first, then the whole page if needed, and runs OpenCV's QR detector. If it finds a payload that decodes and verifies, we know — not infer — the document ID, the page number, and the license that produced it. The grouping function group_into_sections() walks pages in order, starts a new section whenever the document hash changes or a page-1 is seen, and writes split PDFs accordingly. Pages without a verifying payload don't get classified by content. They get attached to the current section as UNKNOWN for a human to look at.
That is the whole pipeline. It runs on the user's machine. It does not call our servers to classify a page. The only network call the desktop app makes is a license validation POST to license.chunkland.com, and the schema for that request — ValidateRequest in services/license/main.py — accepts exactly four fields: license key, machine ID, machine label, and app version. Document contents, file names, and document IDs are not in that schema, which means the server cannot accept them even if a future bug tried to send them.
The compliance picture
Most of what compliance teams want to know about a vendor is "what could you leak about us?" If the answer is structurally "nothing, because we never received it," the rest of the conversation gets shorter. Here is the picture, framework by framework, with the article numbers compliance reviewers actually ask about.
Privacy Act 1988 (Australia) and the APPs
The Australian Privacy Principles govern how regulated entities collect, use, disclose, secure, and destroy personal information. APP 8 covers cross-border disclosure — you generally remain accountable for personal information you let leave Australia. APP 11 covers security of personal information. The Matrix doesn't trigger APP 8 because no cross-border disclosure occurs — nothing leaves the workstation. APP 11 reduces to "secure your own machine," which is a control your IT team already owns. ChunkLand never becomes a recipient of your clients' personal information, so we never become a link in your APP-8 chain.
GDPR (European Union)
Two articles do most of the work in vendor reviews. Article 25 — data protection by design and by default — asks whether the system is built so that, in the default configuration, personal data is handled in the most privacy-preserving way available. A pipeline that processes documents locally and never sends them anywhere is the strongest possible answer to that question. Article 32 — security of processing — asks for technical measures appropriate to the risk. We replace "appropriate measures for processing your data" with "we don't process your data," which collapses most of the threat model into the workstation itself. You remain the controller. We are not a processor for the document content because the content never reaches us.
HIPAA (United States)
HIPAA's Privacy and Security Rules apply to covered entities and their business associates. A vendor that touches Protected Health Information signs a Business Associate Agreement. We do not offer a BAA and we don't need to, because The Matrix does not transmit, store, or process PHI on our infrastructure. PHI lives on your workstation, under your existing HIPAA Security Rule controls. There is no business-associate relationship to formalise because there is no business-associate path.
Attorney–client privilege
Privilege is fragile. Courts have repeatedly examined whether transmitting privileged material through a third-party cloud constitutes voluntary disclosure to that third party, and the answer is jurisdiction-specific and uncomfortable. The cleanest defence is the absence of a third party. The Matrix keeps privileged matter inside the firm's own machines, and the HMAC stamp on each page provides a forensic chain of custody if the authenticity of a produced document is ever challenged. The signature isn't legal advice — it's evidence.
When you should not use The Matrix
This is the section vendor websites usually skip. We're going to put it in the article you're reading.
If you don't print, you can't stamp. The whole approach starts at print time. If your workflow is "supplier emails me a PDF, I save it to a folder," nothing about that document has a Matrix stamp on it. The Matrix can still file pages you've previously printed through it, but it has nothing to verify on a born-digital file from a third party. Receipt-capture apps like Dext or AutoEntry are doing real work there; we are not.
If your problem is a legacy archive of already-scanned PDFs, we can't help. Those pages were never stamped. Re-printing a 200,000-page archive just to stamp it is not the right answer. For one-time digitisation of an existing archive you need OCR, classification, and ideally a human review queue. Rossum, Textract, and similar are the right tool. Use them, get your archive into a known state, then use The Matrix going forward for everything that re-enters paper from this point onward.
If you need data extracted from invoices, not just filing. The Matrix doesn't read invoice line items. It doesn't tell you the GST. It doesn't reconcile against your ledger. If you want fields, use a fields product; we don't compete in that lane.
What we do is the boring, deterministic, audit-defensible part: the right document ends up in the right folder, in the right page order, every time, with a signature you can verify years later. If that's the layer of your stack you're trying to fix, you're in the right place.
Try it on a folder for two weeks
The Matrix is $29/month for one seat, $99 for five, $299 for twenty. There's a 14-day free trial and we don't ask for a card to start it. Install once, point the app at a watched folder, print something, scan it back, and you'll know in ten minutes whether the model fits your work.
If you want the longer security write-up to send to your reviewer, the Security & Compliance page maps each claim above to the exact lines of code in the public repo. If you want to read the codec spec yourself, it's in src/codec.py — the docstring is the spec.
14 days. No credit card. Decide on real documents.
Stamp a real day's worth of work, scan it back, and watch the folder sort itself. Cancel any time from the billing portal.
Pick a plan