RESEARCH

Document Classification Pipeline - 94% Accuracy on Imbalanced Data

30 Nov 2025

Executive Summary

We researched a practical way to classify high-volume business documents under severe class imbalance (a long tail of rare types) without paying LLM-level costs on every item.

The resulting pipeline uses a three-stage routing approach: rules for the obvious cases, a small neural classifier for most remaining volume, and an LLM only when ambiguity remains. The system reaches 94.2% overall accuracy while improving rare-class performance dramatically.

Abstract

Document classification is a foundational step for downstream automation (extraction, routing, compliance checks). In real operations, distributions are skewed: a handful of document types dominate volume, while rare categories carry disproportionate business risk.

We present a staged architecture designed to (1) keep throughput high and costs low, (2) make rare classes tractable through targeted augmentation and evaluation, and (3) preserve a safe fallback for ambiguous items.

Problem Statement

We need to classify incoming documents (invoices, receipts, statements, etc.) across 40+ categories under these constraints:

Long-tail imbalance: 15 types account for <2% of volume.
Cost and latency budgets: sub-second per document, low marginal cost.
Operational safety: unclear cases must be routed for higher-precision handling.

Dataset

127,000 labeled documents across 43 categories
Class distribution: 3 types = 68% of volume, 15 types < 0.5% each
Languages: English (71%), German (24%), Other (5%)

Baselines

Fine-tuned BERT: 81% accuracy, rare-class F1 < 0.10
GPT-4 zero-shot: 87% accuracy, $0.08/document, ~2.4s latency

Approach

Stage 1: Rules + Layout Signals (Fast Path)

Regex, layout-derived features, and vendor-specific headers handle the most common categories.

Stage 2: Lightweight CNN Classifier

A small model (3.2M params) classifies the remaining volume efficiently.

Stage 3: LLM Fallback for Ambiguity

Only ambiguous cases are sent to an LLM to preserve accuracy where it matters.

Rare-Class Strategy

Synthetic minority oversampling for rare types
Adversarial examples to reduce brittle failures (e.g., template drift)
Evaluation emphasis on per-class metrics, not only overall accuracy

Evaluation Design

We measure:

Overall accuracy + per-class F1
Rare-class F1 (aggregate across the long tail)
Throughput (latency and queue behavior)
Cost per document under routing proportions

Results

Performance

Overall accuracy: 94.2%
Rare class F1: 0.76 (vs. 0.10 baseline)
Average processing time: 180ms (vs. 2.4s GPT-4)
Cost per document: $0.0009 (vs. $0.08 GPT-4)

Cost-Quality Tradeoff

62% processed by rules: <1ms, $0.0001
33% processed by CNN: 140ms, $0.0003
5% processed by LLM: 2.1s, $0.06

Error Analysis

72% of errors occurred on poor scan quality (<150 DPI)
Adding a preprocessing quality gate reduced error rate by 19%
Template drift was mitigated with adversarial augmentation and periodic rule refresh

Deployment Notes

Treat routing thresholds as policy: version them and monitor drift.
Maintain a “golden set” of rare-class examples and review them weekly.
Provide operators a reason code (rule match, CNN confidence, LLM decision) for trust.

Commercial Application

This work enables:

low-cost ingestion for invoice and accounting pipelines
faster exception handling (ambiguous cases surfaced with context)
measurable governance via per-class metrics and routing audit logs

Licensable Outcomes

Three-stage classification framework (Python, MIT): modular pipeline with routing logic
Synthetic data generator: realistic document variation engine for rare classes
Model optimization toolkit: quantization + pruning scripts reducing inference cost by 64% with <1% accuracy loss
Evaluation suite: 5,200-document holdout set with cost + per-class metrics

Limitations and Next Work

Rare classes remain sensitive to unseen vendor templates; continual evaluation is required.
Future work: active-learning loop using operator corrections to refresh the rare-class set.

Evaluation Date: November 2025
Status: Core framework licensed to 2 fintech companies, optimization toolkit open-sourced

Back to Research Contact

Sign in