Emergency Call Background Sound Classifier
This repository hosts a machine learning model for automatically classifying background sounds in emergency calls. The model is built using a Convolutional Neural Network (CNN) and is designed to help emergency dispatchers quickly identify 11 common environmental sounds. The model was built using TensorFlow/Keras and trained on a subset of the ESC (Environmental Sound Classification) dataset.
Model Description
The model accepts a 2D Mel-spectrogram from a 3-5 second audio clip as input and outputs the probabilities for each of the following 11 sound classes:
siren
car_horn
chainsaw
crying_baby
dog
door_wood_knock
door_wood_creaks
engine
glass_breaking
rain
fireworks
Dataset
Source and Size
The dataset used in this project is a subset of the ESC: Dataset for Environmental Sound Classification, a well-known benchmark dataset for environmental sound classification. This project specifically uses 403 audio clips distributed among 11 classes selected from the full dataset. Each audio clip is 3 seconds long and is in .wav
format.
Data Augmentation
To improve the model's robustness and generalization, the dataset was artificially augmented. Each original audio file generated 10 new variations using the following techniques:
- Adding Noise
- Low-pass & Band-pass Filters
- Volume Adjustment (+6dB and -6dB)
- Pitch Shifting
- Time Stretching
- Adding Echo
- Distortion (Clipping)
This process resulted in a much larger and more diverse final training dataset, totaling over 4000 audio samples.
Project Workflow
This project follows a standard machine learning workflow for audio:
1. Feature Extraction
Instead of using averaged feature vectors, this model uses the full 2D Mel-spectrogram for each audio clip. This allows the CNN to learn spatial and temporal patterns, much like analyzing an image.
- Sample Rate: 16000 Hz
- Number of Mel Bands: 128
- Input Size: Each spectrogram is cropped or padded to a uniform size of
(128, 188)
.
2. CNN Model Architecture
The model uses a Sequential
architecture from Keras with the following details:
- Input Layer: Accepts an input shape of
(128, 188, 1)
. - Three Convolutional Blocks: Each block consists of
Conv2D
(with 32, 64, and 128 filters),MaxPooling2D
, andBatchNormalization
to stabilize learning. - Flatten Layer: Converts the output from the convolutional blocks into a 1D vector.
- Dense Layers: A
Dense
layer with 128 units (andDropout
of 0.5 for regularization) is followed by aDense
output layer with 11 units (corresponding to the number of classes) and asoftmax
activation.
3. Training
The model was trained with the following configuration:
- Optimizer:
adam
- Loss Function:
sparse_categorical_crossentropy
- Callbacks:
EarlyStopping
was used to automatically stop training if thevalidation loss
did not improve, preventing overfitting.ModelCheckpoint
was used to save the best version of the model during training.
Results & Evaluation
After training, the model was evaluated on a test set (20% of the total data) that it had never seen before.
The final accuracy on the test set reached ~92%.
Detailed Classification Report:
precision recall f1-score support
car_horn 1.00 0.95 0.98 88
chainsaw 0.99 0.94 0.96 81
crying_baby 0.99 0.98 0.98 86
dog 0.99 0.99 0.99 88
door_wood_creaks 0.98 0.92 0.95 51
door_wood_knock 0.98 0.96 0.97 48
engine 0.85 0.45 0.59 88
fireworks 0.96 0.99 0.97 88
glass_breaking 0.99 0.98 0.98 88
rain 0.62 0.98 0.76 84
siren 0.97 1.00 0.98 88
accuracy 0.92 878
macro avg 0.94 0.92 0.92 878
weighted avg 0.93 0.92 0.92 878
The training history graph shows a healthy learning process without significant overfitting, thanks in large part to the use of Dropout
, BatchNormalization
, and EarlyStopping
.
How to Use the Model
Here is an example of how to use this model with TensorFlow/Keras to predict the label of a new audio file.
import numpy as np
import librosa
import joblib
from tensorflow.keras.models import load_model
# Load the saved model and label encoder
model = load_model('path/to/your/best_model.keras')
label_encoder = joblib.load('path/to/your/cnn_label_encoder.pkl')
# Define the preprocessing and prediction function
def predict_sound(audio_path):
"""
Function to predict the sound class from an audio file.
"""
# Parameters must be the same as during training
n_mels = 128
max_pad_len = 188
try:
# 1. Load audio
y, sr = librosa.load(audio_path, sr=16000)
# 2. Extract Mel-spectrogram
mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)
mel_db = librosa.power_to_db(mel, ref=np.max)
# 3. Ensure the size is uniform
if mel_db.shape[1] > max_pad_len:
mel_db = mel_db[:, :max_pad_len]
else:
pad_width = max_pad_len - mel_db.shape[1]
mel_db = np.pad(mel_db, pad_width=((0, 0), (0, pad_width)), mode='constant')
# 4. Add batch and channel dimensions
mel_db = np.expand_dims(mel_db, axis=0) # Add batch dimension
mel_db = np.expand_dims(mel_db, axis=-1) # Add channel dimension
# 5. Make a prediction
probabilities = model.predict(mel_db)
predicted_index = np.argmax(probabilities, axis=1)[0]
predicted_label = label_encoder.inverse_transform([predicted_index])[0]
return predicted_label, probabilities[0][predicted_index]
except Exception as e:
return f"Error processing file: {e}", None
# --- Example Usage ---
example_file = 'path/to/your/new_audio.wav'
label, confidence = predict_sound(example_file)
if label:
print(f"File: {example_file}")
print(f"Predicted Label: {label}")
print(f"Confidence Level: {confidence:.2%}")
Citation
If you use the Maleo Environmental Classification in your research or project, please cite the following:
BibTeX:
@article{Mardiana2025environmental_classification,
title={{Maleo Environmental Classification}},
author={Mardiana, Ardi and Irawan, Eka Tresna end Yanuari, Puri Dewi and Abdurahman, Dede.},
journal={Unpublished work},
year={2025},
url={https://huggingface.co/maleo-ai/environmental_classification}
}
Contact
For any inquiries or further information regarding this dataset, please contact the authors: Ardi Mardiana.
- Downloads last month
- -