RESEARCH
Document Classification Pipeline - 94% Accuracy on Imbalanced Data
30 Nov 2025
Executive Summary
We researched a practical way to classify high-volume business documents under severe class imbalance (a long tail of rare types) without paying LLM-level costs on every item.
The resulting pipeline uses a three-stage routing approach: rules for the obvious cases, a small neural classifier for most remaining volume, and an LLM only when ambiguity remains. The system reaches 94.2% overall accuracy while improving rare-class performance dramatically.
Abstract
Document classification is a foundational step for downstream automation (extraction, routing, compliance checks). In real operations, distributions are skewed: a handful of document types dominate volume, while rare categories carry disproportionate business risk.
We present a staged architecture designed to (1) keep throughput high and costs low, (2) make rare classes tractable through targeted augmentation and evaluation, and (3) preserve a safe fallback for ambiguous items.
Problem Statement
We need to classify incoming documents (invoices, receipts, statements, etc.) across 40+ categories under these constraints:
- Long-tail imbalance: 15 types account for <2% of volume.
- Cost and latency budgets: sub-second per document, low marginal cost.
- Operational safety: unclear cases must be routed for higher-precision handling.
Dataset
- 127,000 labeled documents across 43 categories
- Class distribution: 3 types = 68% of volume, 15 types < 0.5% each
- Languages: English (71%), German (24%), Other (5%)
Baselines
- Fine-tuned BERT: 81% accuracy, rare-class F1 < 0.10
- GPT-4 zero-shot: 87% accuracy, $0.08/document, ~2.4s latency
Approach
Stage 1: Rules + Layout Signals (Fast Path)
Regex, layout-derived features, and vendor-specific headers handle the most common categories.
Stage 2: Lightweight CNN Classifier
A small model (3.2M params) classifies the remaining volume efficiently.
Stage 3: LLM Fallback for Ambiguity
Only ambiguous cases are sent to an LLM to preserve accuracy where it matters.
Rare-Class Strategy
- Synthetic minority oversampling for rare types
- Adversarial examples to reduce brittle failures (e.g., template drift)
- Evaluation emphasis on per-class metrics, not only overall accuracy
Evaluation Design
We measure:
- Overall accuracy + per-class F1
- Rare-class F1 (aggregate across the long tail)
- Throughput (latency and queue behavior)
- Cost per document under routing proportions
Results
Performance
- Overall accuracy: 94.2%
- Rare class F1: 0.76 (vs. 0.10 baseline)
- Average processing time: 180ms (vs. 2.4s GPT-4)
- Cost per document: $0.0009 (vs. $0.08 GPT-4)
Cost-Quality Tradeoff
- 62% processed by rules: <1ms, $0.0001
- 33% processed by CNN: 140ms, $0.0003
- 5% processed by LLM: 2.1s, $0.06
Error Analysis
- 72% of errors occurred on poor scan quality (<150 DPI)
- Adding a preprocessing quality gate reduced error rate by 19%
- Template drift was mitigated with adversarial augmentation and periodic rule refresh
Deployment Notes
- Treat routing thresholds as policy: version them and monitor drift.
- Maintain a “golden set” of rare-class examples and review them weekly.
- Provide operators a reason code (rule match, CNN confidence, LLM decision) for trust.
Commercial Application
This work enables:
- low-cost ingestion for invoice and accounting pipelines
- faster exception handling (ambiguous cases surfaced with context)
- measurable governance via per-class metrics and routing audit logs
Licensable Outcomes
- Three-stage classification framework (Python, MIT): modular pipeline with routing logic
- Synthetic data generator: realistic document variation engine for rare classes
- Model optimization toolkit: quantization + pruning scripts reducing inference cost by 64% with <1% accuracy loss
- Evaluation suite: 5,200-document holdout set with cost + per-class metrics
Limitations and Next Work
- Rare classes remain sensitive to unseen vendor templates; continual evaluation is required.
- Future work: active-learning loop using operator corrections to refresh the rare-class set.
Evaluation Date: November 2025
Status: Core framework licensed to 2 fintech companies, optimization toolkit open-sourced