[WACV'26] Multimodal Adversarial Training — Resources

This repository hosts model checkpoints and data resources for the paper:

Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

For the source code and training scripts, please refer to the GitHub repository: 👉 https://github.com/CyberAgentAILab/multimodal-adversarial-training

📘 Overview

This work proposes Multimodal Adversarial Training (MAT) for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, MAT+, additionally leverages one-to-many relationships in image-text pairs to improve robustness.

Highlights

  • Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP).
  • MAT+ leverages one-to-many relationships in image-text pairs.
  • Reproducible results on Flickr30k and COCO benchmarks.

📘 Directory structure

resources/
├── checkpoints/                          # MAT/MAT+ model checkpoints
│     ├── ALBEF_flickr_MAT_HumanCaps.pth
│     ├── BLIP_flickr_MAT_HumanCaps.pth
│     ├── CLIP_B_coco_MAT_HumanCaps.pth
│     ├── CLIP_B_coco_MAT_base.pth
│     └── CLIP_B_flickr_MAT_HumanCaps.pth
└── augmentations/                        # Data augmentations for MAT+
      ├── dataset_json.zip                # Text augmentation annotations
      └── flickr_SD_I2I_0.5.zip          # Image augmentations (SD img2img)

📘 Checkpoints

Adversarially trained model checkpoints for image-text retrieval:

File Model Dataset Variant
ALBEF_flickr_MAT_HumanCaps.pth ALBEF Flickr30k MAT + HumanCaps
BLIP_flickr_MAT_HumanCaps.pth BLIP Flickr30k MAT + HumanCaps
CLIP_B_coco_MAT_HumanCaps.pth CLIP ViT-B COCO MAT + HumanCaps
CLIP_B_coco_MAT_base.pth CLIP ViT-B COCO MAT (base)
CLIP_B_flickr_MAT_HumanCaps.pth CLIP ViT-B Flickr30k MAT + HumanCaps

The base models used for training are:

📘 Augmentations

Data augmentations used to reproduce MAT+ results:

File Description
dataset_json.zip Text augmentation data — augmented captions and annotations in JSON format
flickr_SD_I2I_0.5.zip Image augmentation data — Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5)

📘 Usage

  1. Clone or download this repository:

    # Using the Hugging Face CLI
    hf download cyberagent/multimodal-adversarial-training --local-dir ./resources
    
    # Or using git with LFS
    git lfs install
    git clone https://huggingface.co/cyberagent/multimodal-adversarial-training
    
  2. Clone the code repository and follow its setup instructions.

  3. Update the checkpoint and data paths in configs/ to point to the downloaded resources.

📘 Citation

If you find these resources useful, please cite:

@inproceedings{waseda2026multimodal,
  title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
  author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}

📘 Acknowledgements

This work builds upon the following repositories:

📘 License

This repository is licensed under the GNU General Public License v3.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Free AI Image Generator No sign-up. Instant results. Open Now