A Review of Transformer-Based Approaches for Image Captioning

doi:10.3390/app131911103

Paper

A Review of Transformer-Based Approaches for Image Captioning

Published Oct 9, 2023 · Oscar Ondeng, Heywood Ouma, Peter O. Akuon

Applied Sciences

11

Citations

0

Influential Citations

PDF

Abstract

Visual understanding is a research area that bridges the gap between computer vision and natural language processing. Image captioning is a visual understanding task in which natural language descriptions of images are automatically generated using vision-language models. The transformer architecture was initially developed in the context of natural language processing and quickly found application in the domain of computer vision. Its recent application to the task of image captioning has resulted in markedly improved performance. In this paper, we briefly look at the transformer architecture and its genesis in attention mechanisms. We more extensively review a number of transformer-based image captioning models, including those employing vision-language pre-training, which has resulted in several state-of-the-art models. We give a brief presentation of the commonly used datasets for image captioning and also carry out an analysis and comparison of the transformer-based captioning models. We conclude by giving some insights into challenges as well as future directions for research in this area.

Study Snapshot

Transformer-based image captioning models have significantly improved performance in visual understanding tasks, with vision-language pre-training leading to state-of-the-art models.

PopulationOlder adults (50-71 years)

Sample size24

MethodsObservational

OutcomesBody Mass Index projections

ResultsSocial networks mitigate obesity in older groups.

A Review of Transformer-Based Approaches for Image Captioning

References

Citations