Audio classification

Hey everyone!
I need to sort audio recordings of machine sounds into groups so I can figure out if there is a problem with the system (knocks, grinding, clicks, etc.) or if it is working properly. I also need to name and test about 100 audio files.

Which type works best for this job? Are there models that have already been learned and can be tweaked? Or, what method do you think would work best?

The following method has already been tried by me: For each audio file, I made a spectrogram and fine-tuned the YOLOv8 model to find differences. However, this did not give me the accuracy I was looking for, probably because the sample was too small.

Thank you ahead of time

7 Likes

Check out who won the DCASE challenge the last three years. There are at least some tips you should get.

8 Likes

Total length of time for your samples? How many are working properly and how many are not?

Does someone know how many types of failure sounds there are, or do you need to find out? I’ve written a script that can take an audio file, pull out features like mfcc, spectral contrast, and chroma features, use faiss kmeans to go through (I have set 2–10) a range of cluster numbers to find the best number of clusters (this part I’m not happy with yet), and so on. I can put it on github if you’re interested.

I read about uncontrolled deep learning being used in a similar way—have you looked it up on arxiv?—that was the first thing that came to mind., but that might take some time.

7 Likes

So image classification of the spectrograms? How long are the audio samples?

5 Likes

It’s around 3 minutes

4 Likes

I think your sample size is too small, esp to avoid overfitting. Since the recordings are long can you split them up? Maybe use clustering to see if there are distinct periods or just at random. My other suggestion is to use time series classification instead. Use audio feature extraction like MFCC, Chroma, Spectral and maybe even Rhythmic features (librosa library for python). Then use time series classification and see if it produces better results.

5 Likes

Why not try a WaveNet?

4 Likes

I was thinking this model is for voice generation isn’t it ?

3 Likes

I have 104 samples 3 minutes each.
There are 3-4 different malfunction sounds but firstly I wanna train model just to separate normal audio and audio with malfunction sounds.

I would be very grateful if you would share a link to github with your script, you’ve got interesting approach

I haven’t seen arxiv just google. And I also tried my theory with YOLO but there are also some problems with audio because there are some noises in the audio and some of them are not of very good quality, so I think it’s worth preprocessing them before sending to the model

2 Likes

Since i thought whisper is a speech transcription model I didn’t think in that direction but I’ll try it now thank you.

How large dataset did you need to get your score?