Bird Audio Detection: baseline tests – and the problem of generalisation

Recently we launched the Bird Audio Detection Challenge, providing new datasets and a contest to improve the state of the art in general-purpose bird detection in audio.

So what is the state of the art? For a thorough answer, you can read our survey paper recently presented at the IEEE MLSP workshop 2016. For a numerical estimate of quality, we can take the data we’ve published, apply a couple of off-the-shelf methods, and see what happens…

An important thing we want to encourage through this challenge is generalisation. This means we want a detector that works across a wide range of species, and across a wide range of environments – without having to be manually tweaked each time it’s used in a different context. That’s why we published two very different datasets for you to develop your ideas on. One dataset is miscellaneous field recordings (containing birds about 25% of the time, and various other background sounds) from all around the world, while one is crowdsourced mobile-phone recordings from our bird classification app Warblr (containing birds about 75% of the time, and often with the sound of people mixed in) from all around the UK. (In the final test we’ll also have recordings from a site of high scientific interest: the Chernobyl Exclusion Zone.)

So here’s a question: if we take an existing machine-learning method, and “train” it using one of those datasets, can it generalise well enough to make sense of the other dataset?

We’ve looked at this question using two audio classifiers which we have here:

  1. The first is the classic “MFCCs and GMMs” method, which has been used in too many research papers to mention. It’s often a standard not-great baseline against which other methods are compared, but in some cases it may itself be good enough for the job. We used our open-source Python implementation of this called smacpy.
  2. The second is a variant of the method that we found useful for species classification: spherical k-means feature learning, followed by a Random Forest classifier. Described in more detail in our journal paper on large-scale classification of bird sounds. (Labelled as “skfl” in the plot below.)

So how well did they perform? Here’s a graph:

Plot of performance of classifiers, discussed below


To understand the AUC value (Area Under the ROC Curve) on the y-axis: 100% is what we want, perfect detection, while 50% is the value you get for “pure ignorance” (e.g. flipping a coin to decide how to label each file).

The first thing you notice is that our “modern” classifier (in yellow and green) performs much better than the baseline MFCC+GMM classifier (orange and blue). OK, so far so good. In fact in the first column the modern classifier gets 100% performance – but that’s to be expected, since in the first column we test each system using the training data that it’s already seen.

Why doesn’t the MFCC+GMM classifier get 100% on the data it’s already seen? Essentially, it’s not powerful/flexible enough to create a map that fully captures all the yes/no decisions it has seen. So of course that’s showing a bit of a limitation. But when we look at the second column – testing the classifier using the same type of data as it was trained with, but using new unseen items – the “mediocre” performance continues at about the same level. The benefit of this inflexible classifier is that it has strongly avoided “overfitting” to any particular set of datapoints, even though it’s showing that it’s not great at the task.

The more powerful classifier does take a bit of a hit when we test it on unseen data – down from 100% to about 85%. I wouldn’t really have expected it to get 100% anyway, but you can see that if you’d only run the first test you might get a false impression…! Even though the performance goes down to 85% it is still stomping on the other classifier, outdoing it by a substantial margin.

This middle column is the kind of result you typically see quoted in research on classifiers. Train it on some data, test it on some similar-but-held-out data. But the third column is where it gets even more interesting!

In the third column, we test each classifier using data from the other dataset (warblr if we trained on ff1010; or ff1010 if we trained on warblr). Importantly, we’re still asking the same question (“Is there any bird or not?”), but the data represents different conditions under which we might ask that question. It certainly contains types of foreground and background sound that the classifier has never been exposed to during its training!

This is a test of generalisation, and you can see that the baseline MFCC+GMM classifier (“smacpy”) really falls at this hurdle. Its quality falls right down to 60% or lower (pretty close to the 50% of “perfect ignorance”!). The modern skfl classifier takes a bit of a hit too, falling to just below 80%. Not a major catastrophe, but below the performance we want to see for a general-purpose detector out in the field.

The two datasets represent related but different types of data collection scenario, and it’s not very surprising that a standard algorithm doesn’t know that we want it to generalise to those other related scenarios, not having been told anything about those other scenarios. In fact, a more “powerful” algorithm may often exhibit this issue more clearly, because it has the freedom to make use of the input data in all kinds of ways that might not be advisable! This issue has recently had a lot of attention in the “deep learning” field, where it has been found (among other things) that many different machine learning systems can be easily fooled by examples that lie outside the domain for which they were trained.

So what can we do about this? There are various techniques that might be useful, and in designing this challenge with bird audio we don’t want to pin things down about what approaches might work well. Maybe it’s possible to normalise the data, removing the factors that make the sound scenarios different from one another. Maybe it’s best not to lean too heavily on machine learning. Maybe it’s best to learn from existing work on how to build more generalisable systems. Maybe it’s best to use a few different methods and combine the results. In a future post we will go into this in more detail, outlining approaches that some researchers have been exploring.

If you’re developing a system to perform detection, the crucial thing to do is dig below the headline score. Don’t just look at the numbers. Find some examples that your system gets wrong (false positives and false negatives) and see if you can identify any tendencies that tell you why it’s doing what it’s doing.

> More about the Bird Audio Detection challenge

Posted in Bird, Uncategorized

Announcing the Bird Audio Detection Challenge

On behalf of the organising committee and the IEEE AASP TC,* we invite you to consider the first Bird Audio Detection Challenge:

Detecting bird sounds in audio is an important basic task in various applications. The current generation of software tools require manual work from the user: to choose the algorithm, to set the settings, and to post-process the results. This is holding bioacoustics back in embracing its “big data” era: let’s make this better!

We look forward to participation from students, researchers and developers. We have prizes of £500 and €500 for the best submissions.

Please share this to anyone who may be interested. This is a cross-disciplinary challenge, and to solve it we need to reach people in many different subject areas.

* the AASP TC is the Audio & Acoustic Signal Processing Technical Committee of the IEEE Signal Processing Society

Posted in Bird, Uncategorized

Three Awards at ISMIR 2016!

Researchers at the Machine Listening Lab at QMUL won three of five awards at the 2016 International Symposium on Music Information Retrieval!

Congratulations to everyone for their hard work!

Posted in Uncategorized

Singing transcription project started

A new collaborative project that will address the problem of automatic transcription of multiple singers has been launched by Queen Mary University of London and Federal University of Rio Grande do Sul (UFRGS – Brazil). The £24k project, entitled “Automatic transcription of polyphonic music recordings“, is funded by the Royal Academy of Engineering through its Newton Research Collaboration Programme, and will last 12 months.

The project involves Dr Rodrigo Schramm (UFRGS) and Dr Emmanouil Benetos (QMUL), and its main goals are to:
1) Propose and develop new technologies to perform automatic transcription of audio recordings generated by multiple singers.
2) Establish a collaborative network between researchers from UFRGS/Brazil and C4DM/UK.
3) Introduce new technologies for music information retrieval which can assist music education in Brazil, especially in the context of e-learning.

Posted in Uncategorized

Summer Launch Event – Tue 28th June

To celebrate the new Machine Listening Lab, we’re pleased to announce an afternoon with two invited seminars from distinguished researchers – specialising in speech and in natural sound:

  • 14:00: Prof Yiannis Stylianou (University of Crete, Greece)
    Adaptive sinusoidal modeling
  • 14:50: Dr Mike Wood (University of Salford, Greater Manchester)
    Listening to alienated life: Acoustic ecology in Chernobyl
  • […break…]
  • 16:30: Icebreak event for QMUL researchers – get to know other researchers working in this field.

You can come to any or all of these events.

Location: Bancroft Road room BR4.02 (fourth floor). (Building #10 on the QMUL campus map.)

(For the icebreak event, QMUL researchers please register with the organisers.)

Posted in Events

HORSE 2016!

Research workshop, QMUL, London
Monday 19 September 2016
Location TBD

Have you uncovered a “horse” in your domain? Or perhaps discovered a “Potemkin village”?

Call for presentations here.

Posted in Uncategorized

Project DaCaRyH has begun!

Read all about it: Data science for the study of Calypso-rhythm through history (DaCaRyH)

Posted in Uncategorized

Funded PhD, deadline April 1st: “Assessing cattle welfare using their vocalisations”

Funded PhD: “Assessing cattle welfare using their vocalisations” – supervised by Dr Alan McElligott. The application deadline is April 1st.

Posted in Opportunities

MLLab at ICASSP 2016

At the ICASSP 2016 conference this week:

Posted in Events