File size: 6,887 Bytes
3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c 3ccbe68 02c146c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
---
library_name: transformers
license: apache-2.0
base_model: t5-base
tags:
- text2text-generation
- music
- spotify
- audio-features
- t5
language:
- en
datasets:
- custom
metrics:
- mae
- mse
- correlation
---
# T5 Spotify Features Generator
A fine-tuned T5-base model that generates Spotify audio features from natural language music descriptions.
## Model Details
### Model Description
This model converts natural language descriptions of music preferences into Spotify audio feature values. For example, "energetic dance music for a party" becomes `"danceability": 0.9, "energy": 0.9, "valence": 0.9`.
- **Developed by:** afsagag
- **Model type:** Text-to-Text Generation (T5)
- **Language(s):** English
- **License:** Apache-2.0
- **Finetuned from model:** [t5-base](https://huggingface.co/t5-base)
### Model Sources
- **Repository:** https://huggingface.co/afsagag/t5-spotify-features-generator
## Uses
### Direct Use
Generate Spotify audio features from music descriptions for:
- Music recommendation systems
- Playlist generation
- Music discovery applications
- Audio feature prediction research
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("afsagag/t5-spotify-features-generator")
tokenizer = T5Tokenizer.from_pretrained("afsagag/t5-spotify-features-generator")
def generate_spotify_features(prompt, model, tokenizer):
input_text = f"prompt: {prompt}"
input_ids = tokenizer(input_text, return_tensors="pt", max_length=256, truncation=True).input_ids
with torch.no_grad():
outputs = model.generate(
input_ids,
max_length=256,
num_beams=4,
early_stopping=True,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
return result
# Example usage
prompt = "I need energetic dance music for a party"
features = generate_spotify_features(prompt, model, tokenizer)
print(features) # Output: "danceability": 0.9, "energy": 0.9, "valence": 0.9
```
### Out-of-Scope Use
- Generating actual audio or music files
- Non-English music descriptions (model trained on English only)
- Precise music recommendation without human oversight
- Applications requiring guaranteed JSON format output
## Bias, Risks, and Limitations
- **Training Data Bias:** Reflects patterns in the training dataset, may not represent all musical styles or cultural contexts
- **JSON Format Issues:** May occasionally generate incomplete JSON objects
- **Subjective Features:** Audio features like "valence" and "energy" are subjective and may not align with all listeners' perceptions
- **Western Music Bias:** Training focused on Western musical concepts and terminology
### Recommendations
- Validate generated features against expected ranges
- Use as a starting point rather than definitive feature values
- Consider cultural and stylistic diversity when applying to diverse music catalogs
- Implement post-processing to ensure valid JSON output if required
## Training Details
### Training Data
Custom dataset of 4,206 examples pairing natural language music descriptions with Spotify audio features:
- **Training set:** 3,364 examples
- **Validation set:** 421 examples
- **Test set:** 421 examples
### Training Procedure
#### Training Hyperparameters
- **Training epochs:** 5
- **Learning rate:** 2e-4
- **Batch size:** 32 (train), 16 (eval)
- **Gradient accumulation steps:** 2
- **LR scheduler:** Cosine with 5% warmup
- **Max sequence length:** 256 tokens
- **Training regime:** bf16 mixed precision
#### Speeds, Sizes, Times
- **Training time:** ~58 minutes
- **Final training loss:** 0.5579
- **Model size:** ~892MB
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Same distribution as training data: natural language music descriptions paired with Spotify audio features.
#### Metrics
- Mean Absolute Error (MAE) between predicted and actual feature values
- Mean Squared Error (MSE) for regression accuracy
- Pearson correlation coefficients for individual features
- Valid JSON ratio for output format correctness
### Results
The model demonstrates strong semantic understanding of musical concepts:
| Prompt | Generated Features |
|--------|-------------------|
| "I need energetic dance music for a party" | `"danceability": 0.9, "energy": 0.9, "valence": 0.9` |
| "Play calm acoustic songs for studying" | `"acousticness": 0.8, "energy": 0.2, "valence": 0.2` |
| "Upbeat music for working out" | `"danceability": 0.7, "energy": 0.8, "valence": 0.7` |
| "Relaxing instrumental background music" | `"acousticness": 0.3, "energy": 0.2, "instrumentalness": 0.8, "valence": 0.2` |
| "Happy pop music for driving" | `"danceability": 0.8, "energy": 0.8, "valence": 0.8` |
## Technical Specifications
### Model Architecture and Objective
- **Base Architecture:** T5 (Text-To-Text Transfer Transformer)
- **Model Size:** t5-base (220M parameters)
- **Objective:** Sequence-to-sequence generation of audio features from text descriptions
- **Input Format:** `"prompt: {natural_language_description}"`
- **Output Format:** JSON-style audio feature values
### Compute Infrastructure
#### Hardware
- GPU with CUDA support
- Mixed precision training (bf16)
#### Software
- PyTorch with CUDA
- Transformers library
- Datasets library for data processing
## Spotify Audio Features Reference
The model generates these Spotify audio features:
- **danceability** (0.0-1.0): How suitable a track is for dancing
- **energy** (0.0-1.0): Perceptual measure of intensity and power
- **valence** (0.0-1.0): Musical positivity (happy vs sad)
- **acousticness** (0.0-1.0): Confidence measure of acoustic nature
- **instrumentalness** (0.0-1.0): Predicts absence of vocals
- **speechiness** (0.0-1.0): Presence of spoken words
- **liveness** (0.0-1.0): Presence of live audience
- **loudness** (dB): Overall loudness, typically -60 to 0 dB
- **tempo** (BPM): Estimated beats per minute
- **duration_ms**: Track duration in milliseconds
- **key** (0-11): Musical key (C=0, C♯/D♭=1, etc.)
- **mode** (0-1): Modality (0=minor, 1=major)
- **time_signature** (3-7): Time signature
- **popularity** (0-100): Spotify popularity score
## Citation
```bibtex
@misc{t5-spotify-features-generator,
author = {afsagag},
title = {T5 Spotify Features Generator: Fine-tuned T5 for Music Feature Prediction from Natural Language},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/afsagag/t5-spotify-features-generator}}
}
```
## Model Card Authors
afsagag
## Model Card Contact
Contact through Hugging Face profile: [@afsagag](https://huggingface.co/afsagag) |