weizhiwang
/

Open-Qwen2VL-base

Image-Text-to-Text

Model card Files Files and versions Community

Open-Qwen2VL-base / README.md

lbourdois's picture

Improve language tag

fd34608 verified 5 months ago

|

1.37 kB

	---
	base_model:
	- Qwen/Qwen2.5-1.5B-Instruct
	- google/siglip-so400m-patch14-384
	datasets:
	- weizhiwang/Open-Qwen2VL-Data
	- MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
	language:
	- zho
	- eng
	- fra
	- spa
	- por
	- deu
	- ita
	- rus
	- jpn
	- kor
	- vie
	- tha
	- ara
	license: cc
	pipeline_tag: image-text-to-text
	---

	# Model Card for Open-Qwen2VL-base

	Open-Qwen2VL-base is a pre-trained base multimodal model that takes images and text as input and produces text as output. This model is described in the paper [Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources](https://huggingface.co/papers/2504.00595). The code is available at [https://github.com/Victorwz/Open-Qwen2VL](https://github.com/Victorwz/Open-Qwen2VL).

	## Updates
	- [4/1/2025] The codebase, model, data, and paper are released.

	<!-- ## Model Details -->

	## How to Use

	The base model is released for further fine-tuning on public SFT data or customized SFT data. It is not appropriate for normal task completions.

	## Citation
	```bibtex
	@article{Open-Qwen2VL,
	title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources},
	author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng},
	journal={arXiv preprint arXiv:2504.00595},
	year={2025}
	}
	...