T5 Spotify Features Generator

A fine-tuned T5-base model that generates Spotify audio features from natural language music descriptions.

Model Details

Model Description

This model converts natural language descriptions of music preferences into Spotify audio feature values. For example, "energetic dance music for a party" becomes "danceability": 0.9, "energy": 0.9, "valence": 0.9.

Developed by: afsagag
Model type: Text-to-Text Generation (T5)
Language(s): English
License: Apache-2.0
Finetuned from model: t5-base

Model Sources

Repository: https://huggingface.co/afsagag/t5-spotify-features-generator

Uses

Direct Use

Generate Spotify audio features from music descriptions for:

Music recommendation systems
Playlist generation
Music discovery applications
Audio feature prediction research

from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("afsagag/t5-spotify-features-generator")
tokenizer = T5Tokenizer.from_pretrained("afsagag/t5-spotify-features-generator")

def generate_spotify_features(prompt, model, tokenizer):
    input_text = f"prompt: {prompt}"
    input_ids = tokenizer(input_text, return_tensors="pt", max_length=256, truncation=True).input_ids
    
    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            max_length=256,
            num_beams=4,
            early_stopping=True,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result

# Example usage
prompt = "I need energetic dance music for a party"
features = generate_spotify_features(prompt, model, tokenizer)
print(features)  # Output: "danceability": 0.9, "energy": 0.9, "valence": 0.9

Out-of-Scope Use

Generating actual audio or music files
Non-English music descriptions (model trained on English only)
Precise music recommendation without human oversight
Applications requiring guaranteed JSON format output

Bias, Risks, and Limitations

Training Data Bias: Reflects patterns in the training dataset, may not represent all musical styles or cultural contexts
JSON Format Issues: May occasionally generate incomplete JSON objects
Subjective Features: Audio features like "valence" and "energy" are subjective and may not align with all listeners' perceptions
Western Music Bias: Training focused on Western musical concepts and terminology

Recommendations

Validate generated features against expected ranges
Use as a starting point rather than definitive feature values
Consider cultural and stylistic diversity when applying to diverse music catalogs
Implement post-processing to ensure valid JSON output if required

Training Details

Training Data

Custom dataset of 4,206 examples pairing natural language music descriptions with Spotify audio features:

Training set: 3,364 examples
Validation set: 421 examples
Test set: 421 examples

Training Procedure

Training Hyperparameters

Training epochs: 5
Learning rate: 2e-4
Batch size: 32 (train), 16 (eval)
Gradient accumulation steps: 2
LR scheduler: Cosine with 5% warmup
Max sequence length: 256 tokens
Training regime: bf16 mixed precision

Speeds, Sizes, Times

Training time: ~58 minutes
Final training loss: 0.5579
Model size: ~892MB

Evaluation

Testing Data, Factors & Metrics

Testing Data

Same distribution as training data: natural language music descriptions paired with Spotify audio features.

Metrics

Mean Absolute Error (MAE) between predicted and actual feature values
Mean Squared Error (MSE) for regression accuracy
Pearson correlation coefficients for individual features
Valid JSON ratio for output format correctness

Results

The model demonstrates strong semantic understanding of musical concepts:

Prompt	Generated Features
"I need energetic dance music for a party"	`"danceability": 0.9, "energy": 0.9, "valence": 0.9`
"Play calm acoustic songs for studying"	`"acousticness": 0.8, "energy": 0.2, "valence": 0.2`
"Upbeat music for working out"	`"danceability": 0.7, "energy": 0.8, "valence": 0.7`
"Relaxing instrumental background music"	`"acousticness": 0.3, "energy": 0.2, "instrumentalness": 0.8, "valence": 0.2`
"Happy pop music for driving"	`"danceability": 0.8, "energy": 0.8, "valence": 0.8`

Technical Specifications

Model Architecture and Objective

Base Architecture: T5 (Text-To-Text Transfer Transformer)
Model Size: t5-base (220M parameters)
Objective: Sequence-to-sequence generation of audio features from text descriptions
Input Format: "prompt: {natural_language_description}"
Output Format: JSON-style audio feature values

Compute Infrastructure

Hardware

GPU with CUDA support
Mixed precision training (bf16)

Software

PyTorch with CUDA
Transformers library
Datasets library for data processing

Spotify Audio Features Reference

The model generates these Spotify audio features:

danceability (0.0-1.0): How suitable a track is for dancing
energy (0.0-1.0): Perceptual measure of intensity and power
valence (0.0-1.0): Musical positivity (happy vs sad)
acousticness (0.0-1.0): Confidence measure of acoustic nature
instrumentalness (0.0-1.0): Predicts absence of vocals
speechiness (0.0-1.0): Presence of spoken words
liveness (0.0-1.0): Presence of live audience
loudness (dB): Overall loudness, typically -60 to 0 dB
tempo (BPM): Estimated beats per minute
duration_ms: Track duration in milliseconds
key (0-11): Musical key (C=0, C♯/D♭=1, etc.)
mode (0-1): Modality (0=minor, 1=major)
time_signature (3-7): Time signature
popularity (0-100): Spotify popularity score

Citation

@misc{t5-spotify-features-generator,
  author = {afsagag},
  title = {T5 Spotify Features Generator: Fine-tuned T5 for Music Feature Prediction from Natural Language},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/afsagag/t5-spotify-features-generator}}
}

Model Card Authors

afsagag

Model Card Contact

Contact through Hugging Face profile: @afsagag

afsagag
/

t5-spotify-features-generator