Task Description

This shared tasks focuses on video-based multimodal machine translation. The goal of the task is to improve English-to-Japanese translation performance with the help of audio-visual information associated to input sentences.

Any external resources, such as pre-trained models/embeddings, LLM systems, and additional training data, can be used as long as they are clearly described in the sytem description paper. You may use the pre-computed video features provided in YouCook2, to say nothing of raw videos.

Data

We use the YouCook2-JP dataset for this task. This is an extended dataset of the YouCook2 captioned video dataset where Japanese translations are manually added. The English captions in YouCook2 explain each step of cooking instruction videos and are well associated to their visual contents. To build YouCook2-JP, we carefully translated the English captions while making it sure to refer to the corresponding video to produce high-quality (less ambiguous) and visually-grounded translations, making it suitable for research on video-based multimodal machine translation. For more details, please visit our project page.

Training and Validation Data

Japanese sentences can be obtained at the project page of YouCook2-JP. Note that you also have to visit the authors’ page of the orignal YouCook2 to download the corresponding videos, English sentences and annotations.

Note that, in this shared task, we are using the training split in the original YouCook2 for our training+validation split, and the validation split in YouCook2 for our testing split, respectively. This point could be confusing, so please be careful. (This is because the captions of the testing split of YouCook2 are not made public.)

English Sentences (with video information): Visit the page of YouCook2 and get youcookii_annotations_trainval.json. Videos with the “subset” entry set to “training” in this JSON file are used as the training+validation split for this shared task.
Japanese Sentences (in the same format as English ones): Download
Raw Videos: Visit the page of YouCook2.

We have 1,333 videos and 10,337 sentence pairs in total. There is no specification for the division of training and validation data, it is left up to the participants.

Testing Data

For ease of evaluation, we use a simple one-line-one-sentence text format for testing. There are 3,123 sentences associated to 457 videos in the testing data.

Source English sentences with video/segment IDs: test_source_english.csv

You can access to the same sentences and other annotations in youcookii_annotations_trainval.json using the video and segment IDs. Videos with the “subset” entry set to “validation” in this JSON file are used as the test split for this shared task. Note that we exculde some original English sentences from our test set that we found incorrect or of poor quality during the manual translation.

Evaluation

Automatic evaluation (BLEU etc.)
Human evaluation (for part of the test samples, depending on the number of participants and submissions.)

Schedule

All deadlines are based on 11:59PM, UTC-12.

Event	Time
Shared Task Submission	September 29 - ~~October 6~~ October 16 (Extended)
System Description Paper Deadline	October 27, 2025
Review Feedback of System Description Papers	November 3, 2025
Camera-ready Deadline	November 11, 2025
Workshop Dates	December 24, 2025

Submission

A submission file for the task should be a plain text file formatted in UTF-8 containing only the translated Japanese sentences, written in the same order as the source file, with one sentence per line. (Therefore, the file should contain 3,123 lines.)

To submit task results and system description papers, please follow the instruction at WAT2025 webpage.

Organizers

Hideki Nakayama, Toshiaki Nakazawa (The University of Tokyo)

Contact

wat25-vctjp(at)nlab.ci.i.u-tokyo.ac.jp

Acknowledgement

This shared task is supported by the commissioned research (No. 225) by National Institute of Information and Communications Technology (NICT), Japan.