Paper
A Review of Transformer-Based Approaches for Image Captioning
Published Oct 9, 2023 · Oscar Ondeng, Heywood Ouma, Peter O. Akuon
Applied Sciences
11
Citations
0
Influential Citations
Abstract
Visual understanding is a research area that bridges the gap between computer vision and natural language processing. Image captioning is a visual understanding task in which natural language descriptions of images are automatically generated using vision-language models. The transformer architecture was initially developed in the context of natural language processing and quickly found application in the domain of computer vision. Its recent application to the task of image captioning has resulted in markedly improved performance. In this paper, we briefly look at the transformer architecture and its genesis in attention mechanisms. We more extensively review a number of transformer-based image captioning models, including those employing vision-language pre-training, which has resulted in several state-of-the-art models. We give a brief presentation of the commonly used datasets for image captioning and also carry out an analysis and comparison of the transformer-based captioning models. We conclude by giving some insights into challenges as well as future directions for research in this area.
Transformer-based image captioning models have significantly improved performance in visual understanding tasks, with vision-language pre-training leading to state-of-the-art models.
Full text analysis coming soon...