Task Description

The goal of the task is to improve translate performance with the help of another modality (images) associated to input sentences.

In the constrained setting (a), external resources such as additional data and pre-trained models/embeddings (with external data), are not allowed to use except for the following.

There is no limitation for the unconstrained setting (b). In both settings (a) and (b), employed resources should be clearly described in the system description paper.

Data

We use the Flickr30kEntities Japanese (F30kEnt-Jp) dataset for this task. This is an extended dataset of the Flickr30k and Flickr30k Entities image caption datasets where manual Japanese translations are newly added. Notably, it has the annotations of many-to-many phrase-to-region correspondences in both English and Japanese captions, which are expected to strongly supervise multimodal grounding and provide new research directions. For details, please visit our project page.

Training and Validation Data

Japanese sentences can be obtained at the above project page of F30kEnt-JP. Note that you also have to visit the authors’ pages of Flickr30k and Flickr30k Entities to download the corresponding images, English sentences and annotations. We use the same splits of training and validation data designated in Flickr30k Entities. This year, we increase the number of sentences for training/validation from the last year.

Please don’t use the samples not included in the training nor in the validation set. They are overlapped with the final testing set.
For each image, you can find the corresponding text files named (Image_ID).txt in Japanese and English sets respectively. The original Flickr30k has five English sentences for each image, and our Japanese set has all corresponding translations. So, we are going to use five parallel sentences per image for training and validation.
In summary, we have 29,783 images (148,915 sentences) for training and 1,000 images (5,000 sentences) for validation, respectively.

(NOTE: several Japanese sentences in the training set are actually kept blank where the corresponding original English captions were corrupted or badly annotated. See the file UNRELATED_CAPTIONS in the Flickr30k Entities.)

Testing Data

We use the same test data as the last year.

Each test file contains 1,000 lines of input sentences corresponding to the images in the same order as test.txt. Note that phrase-to-region annotation is not available in the test data (i.e., only raw texts and images are available to use).

Schedule, Submission, and Evaluation

Please follow the instruction at WAT2021 webpage.

Contact

wat21-mmtjp(at)nlab.ci.i.u-tokyo.ac.jp