T5 Spotify Features Generator

A fine-tuned T5-base model that generates Spotify audio features from natural language music descriptions.

Model Details

Model Description

This model converts natural language descriptions of music preferences into Spotify audio feature values. For example, "energetic dance music for a party" becomes "danceability": 0.9, "energy": 0.9, "valence": 0.9.

  • Developed by: afsagag
  • Model type: Text-to-Text Generation (T5)
  • Language(s): English
  • License: Apache-2.0
  • Finetuned from model: t5-base

Model Sources

Uses

Direct Use

Generate Spotify audio features from music descriptions for:

  • Music recommendation systems
  • Playlist generation
  • Music discovery applications
  • Audio feature prediction research
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("afsagag/t5-spotify-features-generator")
tokenizer = T5Tokenizer.from_pretrained("afsagag/t5-spotify-features-generator")

def generate_spotify_features(prompt, model, tokenizer):
    input_text = f"prompt: {prompt}"
    input_ids = tokenizer(input_text, return_tensors="pt", max_length=256, truncation=True).input_ids
    
    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            max_length=256,
            num_beams=4,
            early_stopping=True,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result

# Example usage
prompt = "I need energetic dance music for a party"
features = generate_spotify_features(prompt, model, tokenizer)
print(features)  # Output: "danceability": 0.9, "energy": 0.9, "valence": 0.9

Out-of-Scope Use

  • Generating actual audio or music files
  • Non-English music descriptions (model trained on English only)
  • Precise music recommendation without human oversight
  • Applications requiring guaranteed JSON format output

Bias, Risks, and Limitations

  • Training Data Bias: Reflects patterns in the training dataset, may not represent all musical styles or cultural contexts
  • JSON Format Issues: May occasionally generate incomplete JSON objects
  • Subjective Features: Audio features like "valence" and "energy" are subjective and may not align with all listeners' perceptions
  • Western Music Bias: Training focused on Western musical concepts and terminology

Recommendations

  • Validate generated features against expected ranges
  • Use as a starting point rather than definitive feature values
  • Consider cultural and stylistic diversity when applying to diverse music catalogs
  • Implement post-processing to ensure valid JSON output if required

Training Details

Training Data

Custom dataset of 4,206 examples pairing natural language music descriptions with Spotify audio features:

  • Training set: 3,364 examples
  • Validation set: 421 examples
  • Test set: 421 examples

Training Procedure

Training Hyperparameters

  • Training epochs: 5
  • Learning rate: 2e-4
  • Batch size: 32 (train), 16 (eval)
  • Gradient accumulation steps: 2
  • LR scheduler: Cosine with 5% warmup
  • Max sequence length: 256 tokens
  • Training regime: bf16 mixed precision

Speeds, Sizes, Times

  • Training time: ~58 minutes
  • Final training loss: 0.5579
  • Model size: ~892MB

Evaluation

Testing Data, Factors & Metrics

Testing Data

Same distribution as training data: natural language music descriptions paired with Spotify audio features.

Metrics

  • Mean Absolute Error (MAE) between predicted and actual feature values
  • Mean Squared Error (MSE) for regression accuracy
  • Pearson correlation coefficients for individual features
  • Valid JSON ratio for output format correctness

Results

The model demonstrates strong semantic understanding of musical concepts:

Prompt Generated Features
"I need energetic dance music for a party" "danceability": 0.9, "energy": 0.9, "valence": 0.9
"Play calm acoustic songs for studying" "acousticness": 0.8, "energy": 0.2, "valence": 0.2
"Upbeat music for working out" "danceability": 0.7, "energy": 0.8, "valence": 0.7
"Relaxing instrumental background music" "acousticness": 0.3, "energy": 0.2, "instrumentalness": 0.8, "valence": 0.2
"Happy pop music for driving" "danceability": 0.8, "energy": 0.8, "valence": 0.8

Technical Specifications

Model Architecture and Objective

  • Base Architecture: T5 (Text-To-Text Transfer Transformer)
  • Model Size: t5-base (220M parameters)
  • Objective: Sequence-to-sequence generation of audio features from text descriptions
  • Input Format: "prompt: {natural_language_description}"
  • Output Format: JSON-style audio feature values

Compute Infrastructure

Hardware

  • GPU with CUDA support
  • Mixed precision training (bf16)

Software

  • PyTorch with CUDA
  • Transformers library
  • Datasets library for data processing

Spotify Audio Features Reference

The model generates these Spotify audio features:

  • danceability (0.0-1.0): How suitable a track is for dancing
  • energy (0.0-1.0): Perceptual measure of intensity and power
  • valence (0.0-1.0): Musical positivity (happy vs sad)
  • acousticness (0.0-1.0): Confidence measure of acoustic nature
  • instrumentalness (0.0-1.0): Predicts absence of vocals
  • speechiness (0.0-1.0): Presence of spoken words
  • liveness (0.0-1.0): Presence of live audience
  • loudness (dB): Overall loudness, typically -60 to 0 dB
  • tempo (BPM): Estimated beats per minute
  • duration_ms: Track duration in milliseconds
  • key (0-11): Musical key (C=0, C♯/D♭=1, etc.)
  • mode (0-1): Modality (0=minor, 1=major)
  • time_signature (3-7): Time signature
  • popularity (0-100): Spotify popularity score

Citation

@misc{t5-spotify-features-generator,
  author = {afsagag},
  title = {T5 Spotify Features Generator: Fine-tuned T5 for Music Feature Prediction from Natural Language},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/afsagag/t5-spotify-features-generator}}
}

Model Card Authors

afsagag

Model Card Contact

Contact through Hugging Face profile: @afsagag

Downloads last month
11
Safetensors
Model size
223M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for afsagag/t5-spotify-features-generator

Base model

google-t5/t5-base
Finetuned
(629)
this model