Improve model card: Add Hear-Your-Click context and refined metadata

This PR updates the model card for `openai/clip-vit-base-patch32`. It clarifies that this CLIP model serves as a crucial component (visual encoder) within the "Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation" framework.

The changes include:
- Retaining the detailed description of the `openai/clip-vit-base-patch32` model.
- Adding a new section that introduces "Hear-Your-Click", its abstract, a link to its paper ([2507.04959](https://huggingface.co/papers/2507.04959)), and its GitHub repository (https://github.com/SynapGrid/Hear-Your-Click-2024).
- Updating metadata with `license: mit`, `library_name: transformers`, and confirming `pipeline_tag: zero-shot-image-classification`.
- Adding additional tags like `clip` and `video-to-audio` for better discoverability and context.
- Including the BibTeX citation for the "Hear-Your-Click" paper.

This update provides valuable context for users interested in the applications of this foundational CLIP model.

Files changed (1) hide show

README.md +32 -7

README.md CHANGED Viewed

@@ -1,10 +1,15 @@
 ---
 tags:
 - vision
 widget:
 - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
   candidate_labels: playing music, playing sports
   example_title: Cat & Dog
 ---
 # Model Card: CLIP
@@ -25,13 +30,11 @@ The model uses a ViT-B/32 Transformer architecture as an image encoder and uses
 The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer.
 ### Documents
 - [Blog Post](https://openai.com/blog/clip/)
 - [CLIP Paper](https://arxiv.org/abs/2103.00020)
 ### Use with Transformers
 ```python3
@@ -53,7 +56,6 @@ logits_per_image = outputs.logits_per_image # this is the image-text similarity
 probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
 ```
 ## Model Use
 ### Intended Use
@@ -74,8 +76,6 @@ Certain use cases which would fall under the domain of surveillance and facial r
 Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases.
 ## Data
 The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as [YFCC100M](http://projects.dfki.uni-kl.de/yfcc100m/). A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users.
@@ -84,8 +84,6 @@ The model was trained on publicly available image-caption data. This was done th
 Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset.
 ## Performance and Limitations
 ### Performance
@@ -136,7 +134,34 @@ We find that the performance of CLIP - and the specific biases it exhibits - can
 We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks.
 ## Feedback

 ---
 tags:
 - vision
+- clip
+- video-to-audio
 widget:
 - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
   candidate_labels: playing music, playing sports
   example_title: Cat & Dog
+license: mit
+library_name: transformers
+pipeline_tag: zero-shot-image-classification
 ---
 # Model Card: CLIP
 The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer.
 ### Documents
 - [Blog Post](https://openai.com/blog/clip/)
 - [CLIP Paper](https://arxiv.org/abs/2103.00020)
 ### Use with Transformers
 ```python3
 probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
 ```
 ## Model Use
 ### Intended Use
 Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases.
 ## Data
 The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as [YFCC100M](http://projects.dfki.uni-kl.de/yfcc100m/). A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users.
 Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset.
 ## Performance and Limitations
 ### Performance
 We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks.
+## Application: Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation
+This `openai/clip-vit-base-patch32` model is utilized as a fundamental visual encoder component within the novel "Hear-Your-Click" framework.
+**Hear-Your-Click** is an interactive Video-to-Audio (V2A) framework enabling users to generate sounds for specific objects by clicking on the frame. This work was presented in the paper:
+[**Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation**](https://huggingface.co/papers/2507.04959)
+**Abstract of the Hear-Your-Click paper:**
+Video-to-audio (V2A) generation shows great potential in fields such as film production. Despite significant advances, current V2A methods relying on global video information struggle with complex scenes and generating audio tailored to specific objects. To address these limitations, we introduce Hear-Your-Click, an interactive V2A framework enabling users to generate sounds for specific objects by clicking on the frame. To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual Encoder (MVE) to obtain object-level visual features aligned with audio. Furthermore, we tailor two data augmentation strategies, Random Video Stitching (RVS) and Mask-guided Loudness Modulation (MLM), to enhance the model's sensitivity to segmented objects. To measure audio-visual correspondence, we designed a new evaluation metric, the CAV score. Extensive experiments demonstrate that our framework offers more precise control and improves generation performance across various metrics.
+You can find the official code and more details on the [Hear-Your-Click GitHub repository](https://github.com/SynapGrid/Hear-Your-Click-2024.git). For usage of the full Hear-Your-Click system, please refer to their repository.
+## Citation
+If you find the **Hear-Your-Click** work useful for your research or applications, please cite our work:
+```bibtex
+@misc{liang2025hearyourclickinteractivevideotoaudiogeneration,
+      title={Hear-Your-Click: Interactive Video-to-Audio Generation via Object-aware Contrastive Audio-Visual Fine-tuning},
+      author={Yingshan Liang and Keyu Fan and Zhicheng Du and Yiran Wang and Qingyang Shi and Xinyu Zhang and Jiasheng Lu and Peiwu Qin},
+      year={2025},
+      eprint={2507.04959},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2507.04959},
+}
+```
 ## Feedback