You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Emergency Call Background Sound Classifier

This repository hosts a machine learning model for automatically classifying background sounds in emergency calls. The model is built using a Convolutional Neural Network (CNN) and is designed to help emergency dispatchers quickly identify 11 common environmental sounds. The model was built using TensorFlow/Keras and trained on a subset of the ESC (Environmental Sound Classification) dataset.

Model Description

The model accepts a 2D Mel-spectrogram from a 3-5 second audio clip as input and outputs the probabilities for each of the following 11 sound classes:

  • siren
  • car_horn
  • chainsaw
  • crying_baby
  • dog
  • door_wood_knock
  • door_wood_creaks
  • engine
  • glass_breaking
  • rain
  • fireworks

Dataset

Source and Size

The dataset used in this project is a subset of the ESC: Dataset for Environmental Sound Classification, a well-known benchmark dataset for environmental sound classification. This project specifically uses 403 audio clips distributed among 11 classes selected from the full dataset. Each audio clip is 3 seconds long and is in .wav format.

Data Augmentation

To improve the model's robustness and generalization, the dataset was artificially augmented. Each original audio file generated 10 new variations using the following techniques:

  • Adding Noise
  • Low-pass & Band-pass Filters
  • Volume Adjustment (+6dB and -6dB)
  • Pitch Shifting
  • Time Stretching
  • Adding Echo
  • Distortion (Clipping)

This process resulted in a much larger and more diverse final training dataset, totaling over 4000 audio samples.

Project Workflow

This project follows a standard machine learning workflow for audio:

1. Feature Extraction

Instead of using averaged feature vectors, this model uses the full 2D Mel-spectrogram for each audio clip. This allows the CNN to learn spatial and temporal patterns, much like analyzing an image.

  • Sample Rate: 16000 Hz
  • Number of Mel Bands: 128
  • Input Size: Each spectrogram is cropped or padded to a uniform size of (128, 188).

2. CNN Model Architecture

The model uses a Sequential architecture from Keras with the following details:

  1. Input Layer: Accepts an input shape of (128, 188, 1).
  2. Three Convolutional Blocks: Each block consists of Conv2D (with 32, 64, and 128 filters), MaxPooling2D, and BatchNormalization to stabilize learning.
  3. Flatten Layer: Converts the output from the convolutional blocks into a 1D vector.
  4. Dense Layers: A Dense layer with 128 units (and Dropout of 0.5 for regularization) is followed by a Dense output layer with 11 units (corresponding to the number of classes) and a softmax activation.

3. Training

The model was trained with the following configuration:

  • Optimizer: adam
  • Loss Function: sparse_categorical_crossentropy
  • Callbacks:
    • EarlyStopping was used to automatically stop training if the validation loss did not improve, preventing overfitting.
    • ModelCheckpoint was used to save the best version of the model during training.

Results & Evaluation

After training, the model was evaluated on a test set (20% of the total data) that it had never seen before.

The final accuracy on the test set reached ~92%.

Detailed Classification Report:

                  precision    recall  f1-score   support

        car_horn       1.00      0.95      0.98        88
        chainsaw       0.99      0.94      0.96        81
     crying_baby       0.99      0.98      0.98        86
             dog       0.99      0.99      0.99        88
door_wood_creaks       0.98      0.92      0.95        51
 door_wood_knock       0.98      0.96      0.97        48
          engine       0.85      0.45      0.59        88
       fireworks       0.96      0.99      0.97        88
  glass_breaking       0.99      0.98      0.98        88
            rain       0.62      0.98      0.76        84
           siren       0.97      1.00      0.98        88

        accuracy                           0.92       878
       macro avg       0.94      0.92      0.92       878
    weighted avg       0.93      0.92      0.92       878

The training history graph shows a healthy learning process without significant overfitting, thanks in large part to the use of Dropout, BatchNormalization, and EarlyStopping.

How to Use the Model

Here is an example of how to use this model with TensorFlow/Keras to predict the label of a new audio file.

import numpy as np
import librosa
import joblib
from tensorflow.keras.models import load_model

# Load the saved model and label encoder
model = load_model('path/to/your/best_model.keras')
label_encoder = joblib.load('path/to/your/cnn_label_encoder.pkl')

# Define the preprocessing and prediction function
def predict_sound(audio_path):
    """
    Function to predict the sound class from an audio file.
    """
    # Parameters must be the same as during training
    n_mels = 128
    max_pad_len = 188
    
    try:
        # 1. Load audio
        y, sr = librosa.load(audio_path, sr=16000)
        
        # 2. Extract Mel-spectrogram
        mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)
        mel_db = librosa.power_to_db(mel, ref=np.max)
        
        # 3. Ensure the size is uniform
        if mel_db.shape[1] > max_pad_len:
            mel_db = mel_db[:, :max_pad_len]
        else:
            pad_width = max_pad_len - mel_db.shape[1]
            mel_db = np.pad(mel_db, pad_width=((0, 0), (0, pad_width)), mode='constant')
        
        # 4. Add batch and channel dimensions
        mel_db = np.expand_dims(mel_db, axis=0)  # Add batch dimension
        mel_db = np.expand_dims(mel_db, axis=-1) # Add channel dimension
        
        # 5. Make a prediction
        probabilities = model.predict(mel_db)
        predicted_index = np.argmax(probabilities, axis=1)[0]
        predicted_label = label_encoder.inverse_transform([predicted_index])[0]
        
        return predicted_label, probabilities[0][predicted_index]

    except Exception as e:
        return f"Error processing file: {e}", None

# --- Example Usage ---
example_file = 'path/to/your/new_audio.wav'
label, confidence = predict_sound(example_file)

if label:
    print(f"File: {example_file}")
    print(f"Predicted Label: {label}")
    print(f"Confidence Level: {confidence:.2%}")

Citation

If you use the Maleo Environmental Classification in your research or project, please cite the following:

BibTeX:

@article{Mardiana2025environmental_classification,
  title={{Maleo Environmental Classification}},
  author={Mardiana, Ardi and Irawan, Eka Tresna end Yanuari, Puri Dewi and Abdurahman, Dede.},
  journal={Unpublished work},
  year={2025},
  url={https://huggingface.co/maleo-ai/environmental_classification}
}

Contact

For any inquiries or further information regarding this dataset, please contact the authors: Ardi Mardiana.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support