Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

doi:10.1007/978-3-030-86331-9_47

Paper

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

Published Feb 18, 2021 · Rafal Powalski, Łukasz Borchmann, Dawid Jurkiewicz

ArXiv

157

Citations

30

Influential Citations

PDF

Abstract

We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.

Highly CitedPreprint

Study Snapshot

Our TILT neural network architecture effectively understands layout, visual features, and textual semantics in documents, achieving state-of-the-art results in Natural Language Comprehension.

PopulationOlder adults (50-71 years)

Sample size24

MethodsObservational

OutcomesBody Mass Index projections

ResultsSocial networks mitigate obesity in older groups.

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

References

Citations