Frame classification for filled pauses

Model Details

This model classifies individual 20ms frames of audio based on presence of filled pauses ("eee", "errm", ...).

Model Description

Developed by: Peter Rupnik, Nikola Ljubešić, Darinka Verdonik, Simona Majhenič
Funded by: MEZZANINE project
Model type: Wav2Vec2Bert for Audio Frame Classification
Language(s) (NLP): Trained and tested on Slovenian ROG-Artur, evaluated also on Croatian, Serbian, Polish, and Czech samples from the ParlaSpeech corpus
Finetuned from model: facebook/w2v-bert-2.0

Paper

@inproceedings{ljubesic-etal-2025-identifying,
    title = "Identifying Filled Pauses in Speech Across South and {W}est {S}lavic Languages",
    author = "Ljube{\v{s}}i{\'c}, Nikola  and Porupski, Ivan  and Rupnik, Peter",
    editor = "Piskorski, Jakub  and P{\v{r}}ib{\'a}{\v{n}}, Pavel  and Nakov, Preslav  and Yangarber, Roman  and Marcinczuk, Michal",
    booktitle = "Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.bsnlp-1.1/",
    doi = "10.18653/v1/2025.bsnlp-1.1",
    pages = "1--8",
    ISBN = "978-1-959429-57-9",
    abstract = "Filled pauses are among the most common paralinguistic features of speech, yet they are mainly omitted from transcripts. We propose a transformer-based approach for detecting filled pauses directly from the speech signal, fine-tuned on Slovenian and evaluated across South and West Slavic languages. Our results show that speech transformers achieve excellent performance in detecting filled pauses when evaluated in the in-language scenario. We further evaluate cross-lingual capabilities of the model on two closely related South Slavic languages (Croatian and Serbian) and two less closely related West Slavic languages (Czech and Polish). Our results reveal strong cross-lingual generalization capabilities of the model, with only minor performance drops. Moreover, error analysis reveals that the model outperforms human annotators in recall and F1 score, while trailing slightly in precision. In addition to evaluating the capabilities of speech transformers for filled pause detection across Slavic languages, we release new multilingual test datasets and make our fine-tuned model publicly available to support further research and applications in spoken language processing."
}

Training data

The model was trained on human-annotated Slovenian speech corpus ROG-Artur. Recordings from the train split were segmented into at most 30s long chunks.

Training Details

hyperparameter	value
learning rate	3e-5
effective batch size	16
num train epochs	20

Evaluation

Although the output of the model is a series 0 or 1, describing their 20ms frames, the evaluation was done on event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted events partially overlap, this is counted as a true positive. We report precisions, recalls, and F1-scores of the positive class.

Evaluation on ROG corpus

Results for Rog-Art test split:

lang	postprocessing	recall	precision	F1
SL	none	0.973	0.914	0.943

Evaluation on ParlaSpeech corpora

Notice: ParlaSpeech corpora are currently in the process of enrichment with new features. Follow our progress here: http://clarinsi.github.io/parlaspeech

For every language in the ParlaSpeech collection, 400 instances were sampled and annotated by human annotators.

Since ParlaSpeech corpora are too big to be manually segmented as ROG is, we observed a few failure modes when inferring. It was discovered that post-processing can be used to improve results. False positives were observed to be caused by improper audio segmentation, which is why disabling predictions that start at the start of the audio or end at the end of the audio can be beneficial. Another failure mode is predicting very short events, which is why ignoring very short predictions can be safely discarded.

With added postprocessing, the model achieves the following metrics:

lang	postprocessing	recall	precision	F1
CZ	drop_short_initial_and_final	0.889	0.859	0.874
HR	drop_short_initial_and_final	0.94	0.887	0.913
PL	drop_short_initial_and_final	0.903	0.947	0.924
RS	drop_short_initial_and_final	0.966	0.915	0.94

Fop details on postprocessing see function frames_to_intervals in the code snippet below.

Example use:


from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
from datasets import Dataset, Audio
import torch
import numpy as np
from pathlib import Path

device = torch.device("cuda")
model_name = "classla/wav2vecbert2-filledPause"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)

ds = Dataset.from_dict(
    {
        "audio": [
            "/cache/peterr/mezzanine_resources/filled_pauses/data/dev/Iriss-J-Gvecg-P500001-avd_2082.293_2112.194.wav"
        ],
    }
).cast_column("audio", Audio(sampling_rate=16_000, mono=True))


def frames_to_intervals(
    frames: list[int],
    drop_short=True,
    drop_initial=True,
    drop_final=True,
    short_cutoff_s=0.08,
) -> list[tuple[float]]:
    """Transforms a list of ones or zeros, corresponding to annotations on frame
    levels, to a list of intervals ([start second, end second]).

    Allows for additional filtering on duration (false positives are often
    short) and start times (false positives starting at 0.0 are often an
    artifact of poor segmentation).

    :param list[int] frames: Input frame labels
    :param bool drop_short: Drop everything shorter than short_cutoff_s,
        defaults to True
    :param bool drop_initial: Drop predictions starting at 0.0, defaults to True
    :param bool drop_final: Drop predictions ending at audio end, defaults to True
    :param float short_cutoff_s: Duration in seconds of shortest allowable
        prediction, defaults to 0.08

    :return list[tuple[float]]: List of intervals [start_s, end_s]
    """
    from itertools import pairwise
    import pandas as pd

    results = []
    ndf = pd.DataFrame(
        data={
            "time_s": [0.020 * i for i in range(len(frames))],
            "frames": frames,
        }
    )
    ndf = ndf.dropna()
    indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
    for si, ei in pairwise(indices_of_change):
        if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
            pass
        else:
            results.append(
                (
                    round(ndf.loc[si, "time_s"], 3),
                    round(ndf.loc[ei, "time_s"], 3),
                )
            )
    if drop_short and (len(results) > 0):
        results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
    if drop_initial and (len(results) > 0):
        results = [i for i in results if i[0] != 0.0]
    if drop_final and (len(results) > 0):
        results = [i for i in results if i[1] != 0.02 * len(frames)]
    return results


def evaluator(chunks):
    sampling_rate = chunks["audio"][0]["sampling_rate"]
    with torch.no_grad():
        inputs = feature_extractor(
            [i["array"] for i in chunks["audio"]],
            return_tensors="pt",
            sampling_rate=sampling_rate,
        ).to(device)
        logits = model(**inputs).logits
    y_pred = np.array(logits.cpu()).argmax(axis=-1)
    intervals = [frames_to_intervals(i) for i in y_pred]
    return {"y_pred": y_pred.tolist(), "intervals": intervals}


ds = ds.map(evaluator, batched=True)
print(ds["y_pred"][0])
# Prints a list of 20ms frames: [0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0....]
# with 0 indicating no filled pause detected in that frame

print(ds["intervals"][0])
# Prints the identified intervals as a list of [start_s, ends_s]:
# [[0.08, 0.28 ], ...]

Paper

Please cite the following paper:

@inproceedings{ljubesic-etal-2025-identifying,
    title = "Identifying Filled Pauses in Speech Across South and {W}est {S}lavic Languages",
    author = "Ljube{\v{s}}i{\'c}, Nikola  and Porupski, Ivan  and Rupnik, Peter",
    editor = "Piskorski, Jakub  and P{\v{r}}ib{\'a}{\v{n}}, Pavel  and Nakov, Preslav  and Yangarber, Roman  and Marcinczuk, Michal",
    booktitle = "Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.bsnlp-1.1/",
    doi = "10.18653/v1/2025.bsnlp-1.1",
    pages = "1--8",
    ISBN = "978-1-959429-57-9",
    abstract = "Filled pauses are among the most common paralinguistic features of speech, yet they are mainly omitted from transcripts. We propose a transformer-based approach for detecting filled pauses directly from the speech signal, fine-tuned on Slovenian and evaluated across South and West Slavic languages. Our results show that speech transformers achieve excellent performance in detecting filled pauses when evaluated in the in-language scenario. We further evaluate cross-lingual capabilities of the model on two closely related South Slavic languages (Croatian and Serbian) and two less closely related West Slavic languages (Czech and Polish). Our results reveal strong cross-lingual generalization capabilities of the model, with only minor performance drops. Moreover, error analysis reveals that the model outperforms human annotators in recall and F1 score, while trailing slightly in precision. In addition to evaluating the capabilities of speech transformers for filled pause detection across Slavic languages, we release new multilingual test datasets and make our fine-tuned model publicly available to support further research and applications in spoken language processing."
}

Downloads last month: 12

Safetensors

Model size

580M params

Tensor type

F32

Model tree for classla/wav2vecbert2-filledPause

Base model

facebook/w2v-bert-2.0

Finetuned

(360)

this model