Skip to content
LoopSmart
Menu

RESEARCH

Document Classification Pipeline - 94% Accuracy on Imbalanced Data

30 Nov 2025

Executive Summary

We researched a practical way to classify high-volume business documents under severe class imbalance (a long tail of rare types) without paying LLM-level costs on every item.

The resulting pipeline uses a three-stage routing approach: rules for the obvious cases, a small neural classifier for most remaining volume, and an LLM only when ambiguity remains. The system reaches 94.2% overall accuracy while improving rare-class performance dramatically.

Abstract

Document classification is a foundational step for downstream automation (extraction, routing, compliance checks). In real operations, distributions are skewed: a handful of document types dominate volume, while rare categories carry disproportionate business risk.

We present a staged architecture designed to (1) keep throughput high and costs low, (2) make rare classes tractable through targeted augmentation and evaluation, and (3) preserve a safe fallback for ambiguous items.

Problem Statement

We need to classify incoming documents (invoices, receipts, statements, etc.) across 40+ categories under these constraints:

  • Long-tail imbalance: 15 types account for <2% of volume.
  • Cost and latency budgets: sub-second per document, low marginal cost.
  • Operational safety: unclear cases must be routed for higher-precision handling.

Dataset

  • 127,000 labeled documents across 43 categories
  • Class distribution: 3 types = 68% of volume, 15 types < 0.5% each
  • Languages: English (71%), German (24%), Other (5%)

Baselines

  • Fine-tuned BERT: 81% accuracy, rare-class F1 < 0.10
  • GPT-4 zero-shot: 87% accuracy, $0.08/document, ~2.4s latency

Approach

Stage 1: Rules + Layout Signals (Fast Path)

Regex, layout-derived features, and vendor-specific headers handle the most common categories.

Stage 2: Lightweight CNN Classifier

A small model (3.2M params) classifies the remaining volume efficiently.

Stage 3: LLM Fallback for Ambiguity

Only ambiguous cases are sent to an LLM to preserve accuracy where it matters.

Rare-Class Strategy

  • Synthetic minority oversampling for rare types
  • Adversarial examples to reduce brittle failures (e.g., template drift)
  • Evaluation emphasis on per-class metrics, not only overall accuracy

Evaluation Design

We measure:

  • Overall accuracy + per-class F1
  • Rare-class F1 (aggregate across the long tail)
  • Throughput (latency and queue behavior)
  • Cost per document under routing proportions

Results

Performance

  • Overall accuracy: 94.2%
  • Rare class F1: 0.76 (vs. 0.10 baseline)
  • Average processing time: 180ms (vs. 2.4s GPT-4)
  • Cost per document: $0.0009 (vs. $0.08 GPT-4)

Cost-Quality Tradeoff

  • 62% processed by rules: <1ms, $0.0001
  • 33% processed by CNN: 140ms, $0.0003
  • 5% processed by LLM: 2.1s, $0.06

Error Analysis

  • 72% of errors occurred on poor scan quality (<150 DPI)
  • Adding a preprocessing quality gate reduced error rate by 19%
  • Template drift was mitigated with adversarial augmentation and periodic rule refresh

Deployment Notes

  • Treat routing thresholds as policy: version them and monitor drift.
  • Maintain a “golden set” of rare-class examples and review them weekly.
  • Provide operators a reason code (rule match, CNN confidence, LLM decision) for trust.

Commercial Application

This work enables:

  • low-cost ingestion for invoice and accounting pipelines
  • faster exception handling (ambiguous cases surfaced with context)
  • measurable governance via per-class metrics and routing audit logs

Licensable Outcomes

  1. Three-stage classification framework (Python, MIT): modular pipeline with routing logic
  2. Synthetic data generator: realistic document variation engine for rare classes
  3. Model optimization toolkit: quantization + pruning scripts reducing inference cost by 64% with <1% accuracy loss
  4. Evaluation suite: 5,200-document holdout set with cost + per-class metrics

Limitations and Next Work

  • Rare classes remain sensitive to unseen vendor templates; continual evaluation is required.
  • Future work: active-learning loop using operator corrections to refresh the rare-class set.

Evaluation Date: November 2025
Status: Core framework licensed to 2 fintech companies, optimization toolkit open-sourced