Donut: an OCR-free Document Understanding Transformer

1 minute read

Published:

My notes on this paper: OCR-free Document Understanding Transformer (github repo).

Understanding document images (commercial invoices, receipts, and business cards, etc.) is an important but challenging task. Normally it consists of reading text and a holistic understanding of the document. Current visual document understanding (VDU) solutions rely heavily on Optical Character Recognition (OCR) engines and the main efforts are paied to the understanding task based on the OCR outputs. (OCR-based approaches)

Conventional (OCR-based) document information extraction (IE) pipeline: Goal: extract the structured information from a given semi-structured document image

  1. Text reading (text detection and text recognition): obtain text location (bbox) and comprehend characters from each bbox
    • deep-learning-based OCR
  2. Text understanding: Recognized texts and corresponding locations are passed to the downstream model to extract desired structure form of information

Proposed OCR-free VDU solution (Donut): Goal: modeling a direct mapping from a raw input image to the desired output without OCR.

Donut model

Architecture

Transformer-based visual encoder and textual decoder modules.

Encoder

  • The visual encoder converts the input document image into a set of embeddings.
  • CNN-based models or Transformer-based models can be used as the encoder network. Swin Transformer is used in this paper.

Decoder

  • Given embeddings from the Encoder, the decoder generates a token sequence.
  • BART is used as the decoder architecture.

Output Conversion

The output token sequence is converted to a desired structured format (JSON)

Training scheme: Pre-train-and-fine-tune scheme

Pre-training phase:

Learns how to read the texts by predicting the next words by conditioning jointly on the image and previous text contexts

Objective: minimize cross-entropy loss of next token prediction (by jointly conditioning on the image and previous contexts)

Fine-tuning stage:

Learns how to understand the whole document accord-ing to the downstream task