Bird Audio Detection: tips on building robust detectors

Recently we launched the Bird Audio Detection Challenge, providing new datasets and a contest to improve the state of the art in general-purpose bird detection in audio. In our previous post we talked about the problem of generalisation, demonstrating that automated detectors often have problems generalising beyond the previously-seen data conditions. In this post I’d like to share some signal-processing and machine-learning tips for building systems that can adapt to new conditions.

Signal processing tips:

Filtering. It’s quite common to filter out frequency bands that you’re not interested in. For example, a lot of birdsong analysis starts by filtering out frequencies below about 500 Hz or 1000 Hz, because a lot of the lower frequency end is dominated by sound caused by humans (traffic, aeroplane noise, speech) and also by wind noise. Removing the energy from these is removing distractions from the signal. BUT! Don’t forget that we have a wide range of birds in mind here – if you filter out too many frequencies, you might lose some frequencies containing energy from birds.
Noise reduction. background noise in these outdoor datasets varies quite a lot. In general it’s a good idea to remove it… but how? To be honest most noise reduction systems will not be able to fully remove noise from such diverse scenarios, but some basic noise reduction will probably help to get the signal standing out from the background a bit. A common thing to do is estimate the noise energy at each frequency – assuming that it’s constant through time – and then try to subtract it away. You can do this by spectral median filtering. Alternatively, if you are working with spectrograms, there are various noise-reduction methods designed for images, some of which are appropriate for spectrograms – check in your favourite toolbox what they actually do…
Normalising the data. For example:
- Normalising the amplitude of each sound file is a very simple and very common step: e.g. so that each file’s audio data fits in +-1, or so that it has zero mean and unit variance.
- Spectral whitening is a more involved idea – it makes sure that each frequency band in an audio file has a consistent amount of energy. (You can do this by applying LPC filtering, or via a spectrogram analysis.) The good thing about this is that it takes away any frequency emphasis that might be caused by the microphone or other aspects of the environment.
- Cepstral mean subtraction – this is something used in speech processing to remove the reverberation effect in the recording.
There are a couple of things to watch out for when normalising. One is what to do with a signal that is pure zeros or pure DC – a lot of normalisations could give divide-by-zero errors in this case. Probably best to just skip those, leaving them at zero.
Choice of representation. Most people working in this area will work with a “default” representation of an audio signal as a spectrogram or as MFCCs. Are these optimal? In general people don’t seem to get a massive leap in performance by spending a lot of time tweaking the representation. However, here’s one recent example: Sandsten et al (2016) tested a standard Hann-windowed spectrogram against a multitaper spectrogram, and found a consistent moderate benefit in their classification results.

Now here’s a question: in which order should you apply these steps? Does it matter?

Machine learning tips:

Not every detector is based on machine learning, but it’s a powerful way to go beyond hand-crafted decision rules and make use of existing data examples.

In machine learning research, there are lots of discussions about generalisability and robustness. A useful reference is this textbook: Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. “Covariate shift” is a term referring to a specific type of generalisation problem, very relevant here.

More generally here are some suggestions:

Multi-condition training / data augmentation. if you’re training a neural net or something else, it often helps to include data from various conditions in the training procedure. In the BAD Challenge we specify exactly which datasets you may use, and you may not include other types of dataset – BUT you can use data augmentation to create modified copies of the data in various ways. You could apply low-bitrate MP3 compression to the audio files, add noise, or superimpose files on top of each other (take care of what that implies for labelling – should your augmented data items have the same annotations as the files they’re based on?). The hope is that it will therefore learn better to focus on the aspects of the signal that really matter.
There are programming tools to help you degrade your audio files! For python there’s muda and for matlab there’s the audio degradation toolbox.
- Note that, again, it might make a difference if you do data augmentation before or after the preprocessing mentioned above. If you do data augmentation beforehand, then that’s a closer match to the idea that you’re simply generating extra data points to feed into the pipeline, which is fair enough – but is there much point adding noise, for example, if you’re then going to apply noise reduction? If you do it afterward (i.e. just before you apply your classifier) then this exposes the classifier to a wider range of “irrelevant modifications”.
Self-adaptation. We’ve already told you that the public data and the private testing data come from different recording scenarios. So, even after all the normalisation we probably expect there to be some differences (in the test set) versus the data you’ve been developing with. But your algorithm gets to see all the test data. What if it could automatically tweak its own settings to give itself the best chance of working well, even when it doesn’t know what the true answers are? This is studied in speech recognition, for example, where speakers’ voices might be different from those in the training data, so researchers discuss speaker adaptation techniques. There’s a large amount of speech recognition research, so maybe you’ll find some useful tips from speaker adaptation.
One weird trick used in machine learning is as follows: first run your algorithm and get its predicted outputs; then pretend that those outputs are the actual groundtruth, and use them to tune some algorithm parameters (e.g. re-train the classifier); then re-run the algorithm and take its new decisions as the final output. This relies on the fact that in many cases, the algorithm’s first run – although imperfect – is getting results roughly along the right lines, and so is giving a “weak training signal” that is, one hopes, better than nothing.
Regularisation. Machine-learning methods often can learn anything they want about a dataset, which is good because it means they can discover unknown ways to make predictions, but it’s bad because it means they can rely too heavily on the patterns in their training data. A good machine learning algorithm is one that avoids overfitting. For many methods you can do this by regularisation, which is essentially adding a soft constraint on the learnt parameters so that they can’t simply fit as tightly as they like to the training data. (If you’re a Bayesian, you can also think of using a strong prior on the parameter values.)
Combining models. imagine you have one detector that’s good at detecting high-pitched twittery birds, and one that’s good at detecting lower-pitched birds. You can combine their outputs simply by averaging them together – or perhaps by a weighted average, weighted according to how much you trust their outputs. This is the basic idea of model averaging, also known as late fusion, and it’s often used to get a few percentage-points better performance by people working on machine-learning challenges. The models could be explicitly designed for different purposes (as I suggested above), or they could simply be multiple attempts to solve the problem. More sophisticated than averaging, alternatively you could combine the outputs of the various detectors using yet another machine-learning algorithm.
One good thing about combining models is it’s also an easy way to get outputs that are not just 1 or 0. Remember that for the challenge we ideally want systems that output a value BETWEEN 1 and 0 depending on their confidence. If you’re working with methods that can only give you a black-or-white yes/no answer, then averaging over more than one system can at least give you a bit more nuance in the output numbers.

But remember: if you’re developing a system to perform detection, the crucial thing to do is dig below the headline score. Don’t just add/remove processing steps and look at the numbers that pop out. Find some examples that your system gets wrong (false positives and false negatives) and see if you can identify any tendencies that tell you why it’s doing what it’s doing.

> More about the Bird Audio Detection challenge