Task Description
This shared tasks focuses on video-based multimodal machine translation. The goal of the task is to improve English-to-Japanese translation performance with the help of audio-visual information associated to input sentences.
Any external resources, such as pre-trained models/embeddings, LLM systems, and additional training data, can be used as long as they are clearly described in the sytem description paper. You may use the pre-computed video features provided in YouCook2, to say nothing of raw videos.

Data
We use the YouCook2-JP dataset for this task. This is an extended dataset of the YouCook2 captioned video dataset where Japanese translations are manually added. The English captions in YouCook2 explain each step of cooking instruction videos and are well associated to their visual contents. To build YouCook2-JP, we carefully translated the English captions while making it sure to refer to the corresponding video to produce high-quality (less ambiguous) and visually-grounded translations, making it suitable for research on video-based multimodal machine translation. For more details, please visit our project page.
Training and Validation Data
Japanese sentences can be obtained at the project page of YouCook2-JP. Note that you also have to visit the authors’ page of the orignal YouCook2 to download the corresponding videos, English sentences and annotations.
Note that, in this shared task, we are using the training split in the original YouCook2 for our training+validation split, and the validation split in YouCook2 for our testing split, respectively. This point could be confusing, so please be careful. (This is because the captions of the testing split of YouCook2 are not made public.)
- English Sentences (with video information): Visit the page of YouCook2 and get youcookii_annotations_trainval.json. Videos with the “subset” entry set to “training” in this JSON file are used as the training+validation split for this shared task.
- Japanese Sentences (in the same format as English ones): Download
- Raw Videos: Visit the page of YouCook2.
We have 1,333 videos and 10,337 sentence pairs in total. There is no specification for the division of training and validation data, it is left up to the participants.
Testing Data
For ease of evaluation, we use a simple one-line-one-sentence text format for testing. There are 3,123 sentences associated to 457 videos in the testing data.
- Source English sentences with video/segment IDs: test_source_english.csv
You can access to the same sentences and other annotations in youcookii_annotations_trainval.json using the video and segment IDs. Videos with the “subset” entry set to “validation” in this JSON file are used as the test split for this shared task. Note that we exculde some original English sentences from our test set that we found incorrect or of poor quality during the manual translation.
Evaluation
- Automatic evaluation (BLEU etc.)
- Human evaluation (for part of the test samples, depending on the number of participants and submissions.)
Schedule
All deadlines are based on 11:59PM, UTC-12.
Event | Time |
---|---|
Shared Task Submission | September 29 - October 6 |
System Description Paper Deadline | October 27, 2025 |
Review Feedback of System Description Papers | November 3, 2025 |
Camera-ready Deadline | November 11, 2025 |
Workshop Dates | December 24, 2025 |
Submission
A submission file for the task should be a plain text file formatted in UTF-8 containing only the translated Japanese sentences, written in the same order as the source file, with one sentence per line. (Therefore, the file should contain 3,123 lines.)
To submit task results and system description papers, please follow the instruction at WAT2025 webpage.
Organizers
- Hideki Nakayama, Toshiaki Nakazawa (The University of Tokyo)
Contact
- wat25-vctjp(at)nlab.ci.i.u-tokyo.ac.jp
Acknowledgement
This shared task is supported by the commissioned research (No. 225) by National Institute of Information and Communications Technology (NICT), Japan.