T5 Spotify Features Generator
A fine-tuned T5-base model that generates Spotify audio features from natural language music descriptions.
Model Details
Model Description
This model converts natural language descriptions of music preferences into Spotify audio feature values. For example, "energetic dance music for a party" becomes "danceability": 0.9, "energy": 0.9, "valence": 0.9
.
- Developed by: afsagag
- Model type: Text-to-Text Generation (T5)
- Language(s): English
- License: Apache-2.0
- Finetuned from model: t5-base
Model Sources
Uses
Direct Use
Generate Spotify audio features from music descriptions for:
- Music recommendation systems
- Playlist generation
- Music discovery applications
- Audio feature prediction research
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("afsagag/t5-spotify-features-generator")
tokenizer = T5Tokenizer.from_pretrained("afsagag/t5-spotify-features-generator")
def generate_spotify_features(prompt, model, tokenizer):
input_text = f"prompt: {prompt}"
input_ids = tokenizer(input_text, return_tensors="pt", max_length=256, truncation=True).input_ids
with torch.no_grad():
outputs = model.generate(
input_ids,
max_length=256,
num_beams=4,
early_stopping=True,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
return result
# Example usage
prompt = "I need energetic dance music for a party"
features = generate_spotify_features(prompt, model, tokenizer)
print(features) # Output: "danceability": 0.9, "energy": 0.9, "valence": 0.9
Out-of-Scope Use
- Generating actual audio or music files
- Non-English music descriptions (model trained on English only)
- Precise music recommendation without human oversight
- Applications requiring guaranteed JSON format output
Bias, Risks, and Limitations
- Training Data Bias: Reflects patterns in the training dataset, may not represent all musical styles or cultural contexts
- JSON Format Issues: May occasionally generate incomplete JSON objects
- Subjective Features: Audio features like "valence" and "energy" are subjective and may not align with all listeners' perceptions
- Western Music Bias: Training focused on Western musical concepts and terminology
Recommendations
- Validate generated features against expected ranges
- Use as a starting point rather than definitive feature values
- Consider cultural and stylistic diversity when applying to diverse music catalogs
- Implement post-processing to ensure valid JSON output if required
Training Details
Training Data
Custom dataset of 4,206 examples pairing natural language music descriptions with Spotify audio features:
- Training set: 3,364 examples
- Validation set: 421 examples
- Test set: 421 examples
Training Procedure
Training Hyperparameters
- Training epochs: 5
- Learning rate: 2e-4
- Batch size: 32 (train), 16 (eval)
- Gradient accumulation steps: 2
- LR scheduler: Cosine with 5% warmup
- Max sequence length: 256 tokens
- Training regime: bf16 mixed precision
Speeds, Sizes, Times
- Training time: ~58 minutes
- Final training loss: 0.5579
- Model size: ~892MB
Evaluation
Testing Data, Factors & Metrics
Testing Data
Same distribution as training data: natural language music descriptions paired with Spotify audio features.
Metrics
- Mean Absolute Error (MAE) between predicted and actual feature values
- Mean Squared Error (MSE) for regression accuracy
- Pearson correlation coefficients for individual features
- Valid JSON ratio for output format correctness
Results
The model demonstrates strong semantic understanding of musical concepts:
Prompt | Generated Features |
---|---|
"I need energetic dance music for a party" | "danceability": 0.9, "energy": 0.9, "valence": 0.9 |
"Play calm acoustic songs for studying" | "acousticness": 0.8, "energy": 0.2, "valence": 0.2 |
"Upbeat music for working out" | "danceability": 0.7, "energy": 0.8, "valence": 0.7 |
"Relaxing instrumental background music" | "acousticness": 0.3, "energy": 0.2, "instrumentalness": 0.8, "valence": 0.2 |
"Happy pop music for driving" | "danceability": 0.8, "energy": 0.8, "valence": 0.8 |
Technical Specifications
Model Architecture and Objective
- Base Architecture: T5 (Text-To-Text Transfer Transformer)
- Model Size: t5-base (220M parameters)
- Objective: Sequence-to-sequence generation of audio features from text descriptions
- Input Format:
"prompt: {natural_language_description}"
- Output Format: JSON-style audio feature values
Compute Infrastructure
Hardware
- GPU with CUDA support
- Mixed precision training (bf16)
Software
- PyTorch with CUDA
- Transformers library
- Datasets library for data processing
Spotify Audio Features Reference
The model generates these Spotify audio features:
- danceability (0.0-1.0): How suitable a track is for dancing
- energy (0.0-1.0): Perceptual measure of intensity and power
- valence (0.0-1.0): Musical positivity (happy vs sad)
- acousticness (0.0-1.0): Confidence measure of acoustic nature
- instrumentalness (0.0-1.0): Predicts absence of vocals
- speechiness (0.0-1.0): Presence of spoken words
- liveness (0.0-1.0): Presence of live audience
- loudness (dB): Overall loudness, typically -60 to 0 dB
- tempo (BPM): Estimated beats per minute
- duration_ms: Track duration in milliseconds
- key (0-11): Musical key (C=0, C♯/D♭=1, etc.)
- mode (0-1): Modality (0=minor, 1=major)
- time_signature (3-7): Time signature
- popularity (0-100): Spotify popularity score
Citation
@misc{t5-spotify-features-generator,
author = {afsagag},
title = {T5 Spotify Features Generator: Fine-tuned T5 for Music Feature Prediction from Natural Language},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/afsagag/t5-spotify-features-generator}}
}
Model Card Authors
afsagag
Model Card Contact
Contact through Hugging Face profile: @afsagag
- Downloads last month
- 11
Model tree for afsagag/t5-spotify-features-generator
Base model
google-t5/t5-base