NEW: challenge for WAT2021 is open!

Task Description

The goal of the task is to improve translate performance with the help of another modality (images) associated to input sentences.

In the constrained setting (a), external resources such as additional data and pre-trained models/embeddings (with external data), are not allowed to use except for the following.

There is no limitation for the unconstrained setting (b). In both settings (a) and (b), employed resources should be clearly described in the system description paper.


We use the Flickr30kEntities Japanese (F30kEnt-Jp) dataset for this task. This is an extended dataset of the Flickr30k and Flickr30k Entities image caption datasets where manual Japanese translations are newly added. Notably, it has the annotations of many-to-many phrase-to-region correspondences in both English and Japanese captions, which are expected to strongly supervise multimodal grounding and provide new research directions. For details, please visit our project page.

Training and Validation Data

Japanese sentences can be obtained at the above project page of F30kEnt-JP. Note that you also have to visit the authors’ pages of Flickr30k and Flickr30k Entities to download the corresponding images, English sentences and annotations. We use the same splits of training and validation data designated in Flickr30k Entities.

Please don’t use the samples not included in the training nor in the validation set. They are overlapped with the final testing set.
For each image, you can find the corresponding text files named (Image_ID).txt in Japanese and English sets respectively. While the original Flickr30k has five English sentences for each image, our Japanese set has the translations of the first two sentences of each. So, we are going to use two parallel sentences for each image.
In summary, we have 29,783 images (59,566 sentences) for training and 1,000 images (2,000 sentences) for validation, respectively.

Testing Data

Each test file contains 1,000 lines of input sentences corresponding to the images in the same order as test.txt. Note that phrase-to-region annotation is not available in the test data (i.e., only raw texts and images are available to use).

Schedule, Submission, and Evaluation

Please follow the instruction at WAT2020 webpage.