Reuben's Multimodal Data Lab

non-profit

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

Reubencf updated a collection about 3 hours ago

Audio

Reubencf updated a Space about 7 hours ago

ReubenDataLab/README

Reubencf updated a collection about 9 hours ago

Audio

View all activity

Organization Card

Community About org cards

Reuben Data Lab

🏆 Work here was produced for the Uncharted Data Challenge hosted by Adaption Labs — credit to Adaptive Data by Adaption for organizing the hackathon.

Building open, underserved datasets for training and evaluating modern audio, speech, and multimodal models. Every release is open-sourced on Hugging Face with permissive licensing and rich metadata, targeting the three criteria the Uncharted Data Challenge cares about: under-served problem domains, scarce open-source data, and under-resourced languages.

Datasets

🎵 FMA Labeled — Multi-Attribute Music Dataset

29k Creative-Commons tracks from the Free Music Archive, automatically annotated with lyrics, genre, sub-genres, mood, instruments, BPM, key, vocal type, energy, era, and audio quality using gemini-flash-latest. Paired audio + text for music tagging, music-LM training, and auto-lyric research.

🗣️ Multilingual Synthetic TTS (Qwen3)

~69k synthetic speech clips across 9 languages (en, ja, zh, ko, de, es, fr, ru, pt) generated with Qwen3-TTS-12Hz via zero-shot voice cloning from a rotating pool of reference speakers. Covers conversational, informational, technical, emotional, and proverb-style utterances — useful for TTS fine-tuning, ASR augmentation, and cross-lingual voice-conversion research.

Focus Areas

Under-resourced languages — expanding speech and text coverage beyond English-only datasets.
Rich supervision — datasets ship with detailed structured metadata (genre/mood/BPM/key for music; language/style/voice for speech), not just audio + class labels.
Permissive licensing — Creative Commons / CC0 where possible; synthetic outputs released for open research.
Reproducibility — generation pipelines and labeling scripts are open-sourced alongside the data.

Tooling & Pipeline

Labeling: Google Gemini (gemini-flash-latest) via Flex and Batch APIs.
Speech synthesis: Qwen3-TTS-12Hz-1.7B-Base on 2× H100 with zero-shot voice cloning.
Infra: Hyperbolic GPU rentals, custom stall-watchers for long-running multi-GPU jobs, Hugging Face Hub for distribution.

Get In Touch

Hugging Face: @Reubencf
Datasets home: ReubenDataLab

More datasets coming soon as part of the Uncharted Data Challenge submission.

Reuben's Multimodal Data Lab

AI & ML interests

Recent Activity

Reuben Data Lab

Datasets

🎵 FMA Labeled — Multi-Attribute Music Dataset

🗣️ Multilingual Synthetic TTS (Qwen3)

Focus Areas

Tooling & Pipeline

Get In Touch

Collections 4

Reubencf/PolyglotAudio

Reubencf/multilingual-synthetic-tts

Reubencf/fma-labeled

Reubencf/streetview-global

Reubencf/PolyglotAudio

Reubencf/multilingual-synthetic-tts

Reubencf/fma-labeled

Reubencf/streetview-global

models 0

datasets 0

AI & ML interests

Recent Activity

Team members 1

Reuben Data Lab

Datasets

🎵 FMA Labeled — Multi-Attribute Music Dataset

🗣️ Multilingual Synthetic TTS (Qwen3)

Focus Areas

Tooling & Pipeline

Get In Touch

Collections 4

models 0

datasets 0