Visually-situated language concerns multimodal settings where text and vision are intermixed, and the meaning of words or phrases is directly influenced by what is observable or referenced visually. Settings where text is embedded in an image are ubiquitous, ranging from text on street signs, to chryrons on news broadcasts, language embedded in figures or social media images, or non-digitized text sources.
Translating visually-situated text combines a series of traditionally separate steps including text detection, optical character recognition, semantic grouping, and finally machine translation. Not only can errors propagate between steps, as generated mistakes cause mismatches in vocabulary and distribution from those observed in training and reduce downstream task performance, but processing each step in isolation separates recognized text from visual context which may be necessary to produce a correct situational translation. For example, as shown in the example above, the English word 'Exit' can be translated to German as either 'Ausfahrt' or 'Ausgang'; without appropriate context, which may not be present in the text alone, the generated translation would be a statistical guess.
Few public benchmarks exist with images containing text and their translations, necessary to study both the impact of OCR errors on downstream MT, and also to develop and evaluate multimodals to directly translate text in images. We introduce the Vistra benchmark to enable research on this task.
Above: a Vistra data sample showing metadata, transcripts, and translations. Vistra comprises 772 natural images containing English text, with aligned translations to four target languages (German, Spanish, Russian, and Mandarin Chinese) with varying levels of visual contextual dependence. Each image is annotated with its height and width, a categorical label, its semantically grouped English transcript, translations to the four target languages aligned at the level of the semantic groups in the transcript, and, word-level bounding boxes specified by corner with coordinates rescaled from 0-1, matched to the aligned word in the transcript. On average, each image contains 11.2 words and 2.4 transcript groups, for a total of 1840 parallel segments in the benchmark with an average length of 4.7 words.
@inproceedings{salesky-etal-2024-benchmarking,
title = "Benchmarking Visually-Situated Translation of Text in Natural Images",
author = "Salesky, Elizabeth and
Koehn, Philipp and
Post, Matt",
editor = "Haddow, Barry and
Kocmi, Tom and
Koehn, Philipp and
Monz, Christof",
booktitle = "Proceedings of the Ninth Conference on Machine Translation",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.wmt-1.115",
pages = "1167--1182",
}