Add comprehensive model card for LLaSO-Base-3.8B-Instruct with pipeline tag, library name, and dataset links

This PR significantly enhances the model card for `LLaSO-Base-3.8B-Instruct`, a foundational model from the LLaSO framework for Large Language and Speech Models.

Key improvements include:
- Adding the `pipeline_tag: audio-text-to-text`, making the model discoverable in relevant searches on the Hugging Face Hub (e.g., https://huggingface.co/models?pipeline_tag=audio-text-to-text).
- Specifying `library_name: transformers`, which enables the automated "how to use" widget on the model page with a standard `transformers` code snippet.
- Including the `language: en` tag, as indicated in the model's configuration.
- Linking the model to its official paper: [LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model](https://huggingface.co/papers/2508.15418).
- Providing explicit links to the associated datasets (`LLaSO-Align`, `LLaSO-Instruct`, `LLaSO-Eval`) in the metadata and content.
- Adding a comprehensive overview of the LLaSO framework and the LLaSO-Base model's key features, adapted from the original GitHub repository.
- Including a practical code example for inference using the `transformers` library, which is directly compatible with the automated widget.
- Linking to the official GitHub repository for further details and code.

Please review these additions for accuracy and completeness.

Files changed (1) hide show

README.md +136 -3

README.md CHANGED Viewed

@@ -1,3 +1,136 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+pipeline_tag: audio-text-to-text
+library_name: transformers
+language: en
+datasets:
+  - YirongSun/LLaSO-Align
+  - YirongSun/LLaSO-Instruct
+  - YirongSun/LLaSO-Eval
+---
+# LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models
+This repository contains **LLaSO-Base-3.8B-Instruct**, a 3.8B-parameter reference model from the **LLaSO** framework. LLaSO is introduced as the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling to advance reproducible research in the field of Large Speech-Language Models (LSLMs).
+LLaSO-Base is trained exclusively on public data provided by the LLaSO framework, achieving a strong, reproducible baseline (normalized score of 0.72) for compositional speech-language understanding across 20 tasks.
+<p align="center">
+  <a href="https://huggingface.co/datasets/YirongSun/LLaSO-Align"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Align-16a085.svg" alt="HF Align"></a>
+  <a href="https://huggingface.co/datasets/YirongSun/LLaSO-Instruct"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Instruct-1abc9c.svg" alt="HF Ins"></a>
+  <a href="https://huggingface.co/datasets/YirongSun/LLaSO-Eval"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Eval-27ae60.svg" alt="HF Eval"></a>
+  <br>
+  <a href="https://huggingface.co/papers/2508.15418"><img src="https://img.shields.io/badge/arXiv-2508.15418-B31B1B.svg" alt="arXiv"></a>
+  <a href="https://huggingface.co/YirongSun/LLaSO-Base-3.8B-Instruct"><img src="https://img.shields.io/badge/HuggingFace-Model-ffcc00.svg" alt="HF Model"></a>
+  <a href="https://github.com/EIT-NLP/LLaSO"><img src="https://img.shields.io/github/stars/EIT-NLP/LLaSO?style=social" alt="GitHub Stars"></a>
+</p>
+*   **Paper:** [LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model](https://huggingface.co/papers/2508.15418)
+*   **Code & Project Page:** [https://github.com/EIT-NLP/LLaSO](https://github.com/EIT-NLP/LLaSO)
+## 🔍 What is LLaSO?
+**LLaSO is the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling in one framework.**
+The framework provides three essential resources:
+-   **LLaSO-Align (12.0M):** An ASR-based alignment corpus for grounding speech in textual semantic space.
+-   **LLaSO-Instruct (13.5M / 20 tasks / 3 modality configs):** A multi-task instruction-tuning dataset across linguistic, semantic, and paralinguistic objectives.
+-   **LLaSO-Eval (15,044):** A reproducible benchmark for standardized evaluation, particularly for instruction-following and cross-modality generalization.
+-   **LLaSO-Base (3.8B):** This model, a two-stage trained reference model adapted from LLaVA-style architectures for robust compositional understanding.
+<p align="center">
+  <img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/radar.png" width="600" alt="LLaSO overall performance">
+</p>
+<p align="center"><i>
+LLaSO-Base achieves a strong normalized overall score on LLaSO-Eval across 20 tasks spanning linguistic, semantic, and paralinguistic categories.
+</i></p>
+## ✨ Key Features
+-   **Fully Open, End-to-End Stack:** Unified release of corpus, benchmark, and model enabling open-source research and fair comparison in speech-language modeling.
+-   **25.5M Samples, 20 Tasks, 3 Modality Configurations:** Supports all major text ↔ audio combinations (text + audio, audio + text, pure audio), covering linguistic, semantic, and paralinguistic tasks.
+-   **Stratified Evaluation (15,044):** Cohesive design between training and test sets enables systematic assessment of instruction following, cross-modality generalization, abstention rate, and stability.
+-   **Robust Reference Model (3.8B):** Two-stage training (ASR alignment → instruction tuning), easily reproducible and extensible for further research.
+-   **Empirical Insights:** Broader task and modality coverage consistently leads to stronger overall performance, but unseen modality/task configurations (especially pure audio) remain challenging; interleaving and parallel decoding strategies can bridge some gaps.
+<p align="center">
+  <img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/architecture_trim.png" width="350" alt="Architecture & Two-Stage Training (Figure 6)"><br>
+  <i>Architecture & Two-Stage Training</i>
+</p>
+## 🚀 Usage
+You can use this model with the `transformers` library. Here's a quick example for inference:
+```python
+import torch
+from transformers import AutoProcessor, AutoModelForCausalLM
+import librosa
+import soundfile as sf
+import os
+import numpy as np
+# Load model and processor
+model_path = "YirongSun/LLaSO-Base-3.8B-Instruct"
+processor = AutoProcessor.from_pretrained(model_path)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path, torch_dtype=torch.bfloat16, device_map="auto"
+)
+model.eval()
+# Example audio input (replace with your audio file)
+# For demonstration, creating a dummy audio file
+dummy_audio_path = "dummy_audio.wav"
+sr = 16000
+duration = 5 # seconds
+dummy_audio_data = (np.random.rand(sr * duration) * 0.5).astype(np.float32)
+sf.write(dummy_audio_path, dummy_audio_data, sr)
+# Load audio and process it
+audio, rate = librosa.load(dummy_audio_path, sr=sr)
+audio_inputs = processor(audio=audio, sampling_rate=rate, return_tensors="pt")
+# Example text prompt
+# The LLaSO models are Llama-3-based, so use the corresponding chat template.
+# The `processor`'s chat template automatically handles adding special tokens for roles.
+# The model uses "<audio_start>" and "<audio_end>" tokens, which are usually handled internally
+# when `audio_values` are passed, or explicitly via tokenization if part of the text prompt.
+# Here, we pass `audio_values` separately as common in multimodal models.
+prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>
+Transcribe the audio.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+"
+text_inputs = processor(text=prompt, return_tensors="pt")
+# Combine inputs
+inputs = {
+    "input_ids": text_inputs.input_ids.to(model.device),
+    "attention_mask": text_inputs.attention_mask.to(model.device),
+    "audio_values": audio_inputs.audio_values.to(model.device)
+}
+# Generate response
+with torch.inference_mode():
+    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.6, top_p=0.9)
+# Decode and print
+decoded_output = processor.decode(outputs[0], skip_special_tokens=True)
+print(f"Generated Text: {decoded_output}")
+# Clean up dummy audio file
+os.remove(dummy_audio_path)
+```
+For more detailed usage, training instructions, and advanced evaluation scenarios, please refer to the [LLaSO GitHub repository](https://github.com/EIT-NLP/LLaSO).
+## 📑 How to Cite
+If you use LLaSO in your research or applications, please cite our paper:
+```bibtex
+@misc{sun2025llaso,
+      title={LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model},
+      author={Yirong Sun and Yizhong Geng and Peidong Wei and Yanjun Chen and Jinghan Yang and Rongfei Chen and Wei Zhang and Xiaoyu Shen},
+      year={2025},
+      eprint={2508.15418},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2508.15418},
+}
+```