Announcing the Synthetic Online Conversations Dataset (SOC)

Community Article Published August 12, 2025

Upvote

Marco De Santis

marcodsn

TL;DR

The Synthetic Online Conversations Dataset (SOC) is a conversational dataset featuring multi-turn chats grounded in rich personas and chat topics. It simulates real online behavior: multi-message turns, variable delay between messages, and multimedia content.
As of today the first revision of the dataset (SOC-2508) has been released under the CC BY 4.0 License.
A Multilingual revision has also been released as SOC-2508-MULTI

Why another conversational dataset?

I'll answer this one right away. It is hard to build a dataset of actually human conversations.

You might pay people to chat and get permission to use their conversations, but even then:

People don't act naturally when they know their chats will be made public — they hardly open up or talk about themselves
Diversity gets expensive fast: getting a representative pool of ages, backgrounds, and conversation styles means recruiting a lot of different people

You could also use one of the many RP (RolePlay) datasets on Hugging Face, but these still have key limitations:

One message per turn only (no rapid-fire follow-ups or multi-part thoughts)
No realistic pacing (no delays, no sense of "typing..." or "responding later")
Multimedia elements are mostly absent (no gifs, images, or voice notes)

With SOC, we are trying to fill this gap! Realistic, flawed-when-necessary, synthetic conversations for analysis where you need human-like chat data without the cost, complexity, or ethical concerns of harvesting real conversations.

Are there known limitations?

I also want to speak about this right away. Yes, as of this initial release (August 2025 -> 2508) there are known limitations:

Message count skews higher than requested: While the chosen LLM (Qwen3-235B-A22B-Instruct-2507 for this release) was prompted to write around 1-3 messages per turn (with 1 being the most probable) it still decided to write 3 or more messages most of the time;
Conversations want to end too early: The LLM was also asked to write an <end/> tag when it thought the conversation was over, but this often happened way too early (we had to programmatically remove the end tag for the first few turns, otherwise most chats would end with just one turn!).

These limitations remove from the naturalness we're aiming for, and so they are at the top of my to-fix list for the next release.

A fix may be as simple as using a different LLM to generate the data (like Kimi K2, with exceptional instruction following capabilities), but that could also drive the cost up quite a bit (by around 5x using Kimi K2!).

How was the dataset made?

Alright, the juicy part now. This project takes heavy inspiration from the methods shown in ConvoGen, with a few modifications.

Iterative sampling, with drift control
Like in ConvoGen, we grow from a compact seed pool: every new, good example can be re-sampled to increase diversity. In practice, models can drift into repetitive or degraded loops, so we periodically reset the few-shot pool back to the original seeds (every ~50–100 generations, depending on seed size). This simple reset preserves variety without letting the form collapse.
Personas → experiences → chats
Instead of jumping straight to conversations, we insert an “experience” step. Each experience pairs two personas and defines: their relationship (how they know each other), the situation (why they’re talking right now, and where, e.g. DMs, Discord, group chats, fandom threads, etc.), and a natural opening topic. That extra scaffold gives the chat model concrete intent and context, improving coherence while leaving room for the natural messiness of online talk.

How we build personas

Personas aren’t free-form blurbs; they’re composed from weighted components so distributions feel realistic while staying varied.

Profession first, with age constraints
We anchor on profession up front, using metadata to bound plausible ages (e.g., interns skew younger; late-career roles skew older). This way we lock in plausibility early and we don't end up with 19-year-old surgeons.
Life context + traits synthesis
We sample a current life context (e.g., relocating, exam season, caring for a parent) and a pool of traits, then compress back to a handful that best fit the final character using an LLM.
Chatting style adapted by age and quirk
A base “chat quirk” (emoji-heavy, bone-dry sarcasm, overly formal, etc.) is adapted to the persona’s age and role, so the same quirk lands differently for a 19-year-old student vs. a 52-year-old ER nurse.
Anti-clone few-shotting
Each new persona is generated with a small set of recent examples as “do not imitate too closely” references, and we periodically reset references back to seed personas. Just as we said before, this preserves variety without letting the form collapse.

How personas are paired into experiences

We don’t just toss random personas together. Pairing is weighted for believable coverage.

Age-aware pairing
We precompute all pairs and weight selections by age similarity to encourage natural dyads (classmates, coworkers, sibling-age peers), while still allowing cross-generation dynamics in rarer cases.
Relationship templates and concrete triggers
Relationship templates (e.g., “mutuals in a fandom Discord,” “neighbors coordinating on WhatsApp,” “ex-colleagues on LinkedIn”) are instantiated with sampled platforms/communities/fandoms. We add a conversation trigger (a tournament invite, a leaked trailer, a group-project deadline) to explain why the chat starts now.
Few-shot, with resets
As with personas, experiences are generated iteratively using recent examples as guidance, plus periodic resets to seed experiences to avoid extreme error propagation.

Chat generation details

Grounded in an experience, the model writes the conversation turn by turn.

Multi-media, multi-message turns and realistic pacing
Turns can include 1–3 messages (sometimes more... check known limitations), with lightweight media tags (<image>, <gif>, <audio> and <video>) that contain descriptive content (e.g., <audio>audio transcription/description</audio>, <image>image description</image>, <gif>gif description</gif>), and optional <delay minutes=XX hours=XX days=XX> annotations to simulate human pauses.
Imperfect by design
We explicitly allow typos, mid-thought edits, topic drift, and uneven effort—because real chats aren’t tidy.
Guardrails and cleanup
We nudge the model away from premature endings (<end/>) and normalize edge cases to keep conversations substantial and usable.

End-to-end pipeline (with diagram!)

Seed data → Persona generation (iterative + resets) → Experience generation (pairing + relationship + situation + trigger) → Chat generation (multi-turn, media tags, delays) → Optional multilingual pass

Scaling Translation with HF Jobs

A special thanks to Hugging Face for providing some compute credits to play with HF Jobs, they gave me one more reason to turn this project into a reality!

Creating the multilingual variant SOC-2508-MULTI turned out to be surprisingly straightforward thanks to Hugging Face Jobs. Literally a simple one-command solution!

Why HF Jobs for Translation?

When you're dealing with thousands of conversations across multiple languages, you need serious computational power. Traditional approaches would require:

Setting up GPU instances manually
Managing dependencies and environments
Handling infrastructure scaling and costs
Dealing with job orchestration and monitoring

HF Jobs eliminated all of this complexity. With a single command, I could spin up an A100 instance pre-configured with everything needed for large-scale translation.

The Translation Pipeline

The translation process used google/gemma-3n-E4B-it with vLLM's continuous batching for high-throughput inference. The pipeline handles the complex multilingual structure of SOC, preserving all the special tags while translating the actual conversational content.

Here's how simple it was to run:

hf jobs uv run --flavor a100-large translate.uv.py marcodsn/SOC-2508 marcodsn/SOC-2508-MULTI --languages fr es de it pt --token xxxxxxxxxxxx --timeout 2h

That's it. One command that:

Spins up an A100 GPU instance
Installs all dependencies via UV
Loads the dataset and model
Translates 1,180 conversations into 5 languages
Pushes the multilingual dataset back to the Hub

Cost Efficiency That Actually Works

The economics were also remarkable: ~$0.50 per language for the entire dataset. For context, that's translating 1,180 full conversations (with their rich personas, experiences, and multi-turn chats) for about the cost of a coffee (or 2 if you are in Italy like me 👀).

This pay-as-you-go model (billed to the second!) meant I only paid for the actual compute time used—no idle instances, no minimum commitments, no infrastructure overhead.

Wrapping up

To wrap up, SOC-2508 is our first big step toward realistic, imperfect, and human-like synthetic conversations — complete with quirks, delays, typos, and those “wait, where were we?” moments that make chats feel alive.

We know there’s still work to do: better control over message counts, more natural endings, and maybe even richer media integration in future versions. But even with its early edges, SOC already fills a gap in available conversational datasets: human-like conversations.

Both SOC-2508 and its companion SPB-2508 dataset are released under the CC BY 4.0 license, giving you the freedom to remix, adapt, and build on the data with attribution. The multilingual variant, SOC-2508-MULTI, extends the same conversations with structured translations into multiple languages while preserving tags, pacing markers, and schema.

If you dig into it and find things we can improve (or fun ways to use it!), I’d love to hear from you. Feedback, experiments, all are welcome!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote