Benchmarking Visually-Situated Translation

Benchmarking Visually-Situated Translation
of Text in Natural Images

1Johns Hopkins University, 2Human Language Technology Center of Excellence, 3Microsoft

Abstract

We introduce a benchmark, Vistra, for visually-situated translation of English text in natural images to four target languages. We describe the dataset construction and composition. We benchmark open-source and commercial OCR and MT models on Vistra, and present both quantitative results and a taxonomy of common OCR error classes with their effect on downstream MT. Finally, we assess direct image-to-text translation with a multimodal LLM, and show that it is able in some cases but not yet consistently to disambiguate possible translations with visual context. We show that this is an unsolved and challenging task even for strong commercial models. We hope that the creation and release of this benchmark which is the first of its kind for these language pairs will encourage further research in this direction.

Motivation

Motivating example image showing contextually-dependent translations of EXIT from English to German

Visually-situated language concerns multimodal settings where text and vision are intermixed, and the meaning of words or phrases is directly influenced by what is observable or referenced visually. Settings where text is embedded in an image are ubiquitous, ranging from text on street signs, to chryrons on news broadcasts, language embedded in figures or social media images, or non-digitized text sources.

Translating visually-situated text combines a series of traditionally separate steps including text detection, optical character recognition, semantic grouping, and finally machine translation. Not only can errors propagate between steps, as generated mistakes cause mismatches in vocabulary and distribution from those observed in training and reduce downstream task performance, but processing each step in isolation separates recognized text from visual context which may be necessary to produce a correct situational translation. For example, as shown in the example above, the English word 'Exit' can be translated to German as either 'Ausfahrt' or 'Ausgang'; without appropriate context, which may not be present in the text alone, the generated translation would be a statistical guess.

Few public benchmarks exist with images containing text and their translations, necessary to study both the impact of OCR errors on downstream MT, and also to develop and evaluate multimodals to directly translate text in images. We introduce the Vistra benchmark to enable research on this task.

Vistra Benchmark Overview

Vistra data sample

Above: a Vistra data sample showing metadata, transcripts, and translations. Vistra comprises 772 natural images containing English text, with aligned translations to four target languages (German, Spanish, Russian, and Mandarin Chinese) with varying levels of visual contextual dependence. Each image is annotated with its height and width, a categorical label, its semantically grouped English transcript, translations to the four target languages aligned at the level of the semantic groups in the transcript, and, word-level bounding boxes specified by corner with coordinates rescaled from 0-1, matched to the aligned word in the transcript. On average, each image contains 11.2 words and 2.4 transcript groups, for a total of 1840 parallel segments in the benchmark with an average length of 4.7 words.

Cite Us

@inproceedings{salesky-etal-2024-benchmarking,
          title = "Benchmarking Visually-Situated Translation of Text in Natural Images",
          author = "Salesky, Elizabeth  and
          Koehn, Philipp  and
          Post, Matt",
          editor = "Haddow, Barry  and
          Kocmi, Tom  and
          Koehn, Philipp  and
          Monz, Christof",
          booktitle = "Proceedings of the Ninth Conference on Machine Translation",
          month = nov,
          year = "2024",
          address = "Miami, Florida, USA",
          publisher = "Association for Computational Linguistics",
          url = "https://aclanthology.org/2024.wmt-1.115",
          pages = "1167--1182",
          }