I am new to the ML world and im looking for some ideas/sanity check.
I'm developing an embedded system that will detect the presence of a kick drum in audio streaming from a microphone. The goal is a real time kick drum detector using a typical ML classification approach.
The issue im having is a discrepancy in performance between the model running on my PC versus the target microcontroller. I have verified the audio preprocessing on the micro is the same as in my python script. Ive "validated" the convolutional neural network running on the uC performs the same as on the desktop. So this leads me to question the dataset that I'm training with.
My instinct is telling me that the issue is with the dataset I'm using. I have dozens of songs along with their respective labels(ive verified these are good) in .wav format that im using to train the model. The audio for the uC is coming from a microphone sampled by an ADC(16 bit) . I'm guessing that this difference is where the issue is coming from.
Ive tried augmenting the audio files to simulate the effects of being recorded on a microphone i.e. quantization noise, amplifier noise, background noise and various volume levels. This makes minimal improvement unfortunately so here I am making this post.
I feel that my next step is to record all of the songs using the target embedded system and then use those to train my data. Im hoping to get some opinions about the issue im seeing and if I need to spend the time to record all the songs using the uC. Any advice is appreciated.
please let me know if details about the system/model are needed.
Is it the intention to connect the mic directly to the uC or through a PA system? Are you sure that the audio for the uC is good? Have you done a few test recordings on the uC and fed them to the PC?
The uC is quite powerful so im using the floating point engine. The spectrogram going into the NN is similar on both the uC and the python script training the NN.
There are differences though. One is the noise floor of the uC compared to a cd quality wav files used to train. Also, the sound level is constant on a wav file while it may and will change in a real world environment. I've augmented every wav file to make several copies with each one having various levels of noise and volume levels. I thought this would help but had minimal effect.