Best paper award at 2017 AES Conference on Semantic Audio

As part of the 2017 AES Conference on Semantic Audio, paper “Automatic transcription of a cappella recordings from multiple singers” by Rodrigo Schramm and Emmanouil Benetos has received the conference’s Best Paper Award. A postprint of the paper can be found here.

Posted in Publications

MLLab research in the IEEE/ACM TASLP special issue on Sound Scene and Event Analysis

Two papers authored by members of the Machine Listening Lab have been published at a special issue of IEEE/ACM Transactions on Audio, Speech and Language Processing on “Sound Scene and Event Analysis”:

Posted in Publications

Seminar: Mauricio Álvarez, Sequential latent force models for segmenting motor primitives

As part of the C4DM seminar series, the Machine Listening Lab and the Centre for Intelligent Sensing jointly present Mauricio Álvarez giving a talk about Sequential latent force models for segmenting motor primitives.

  • Date and Time: Wednesday, 24th May 2017, at 4:00pm
  • Place: Room GC 2.22, Graduate Centre, Queen Mary University of London, Mile End Road, London E1 4NS. (Directions)

Motor primitives are basic representations of human motion that, in a similar way to phonemes in a language, can be used to compose complex movements used for imitation learning in humanoid robotics. The first step when using motor primitives in imitation learning consists of defining a basic vocabulary of motor skills, according to a particular task that the humanoid robot is supposed to perform. Such vocabularies are usually learned from multivariate time course data. In this talk, I will describe two alternatives for segmentation of motor primitives from multivariate time course data that involve the use of latent force models. A latent force model encodes a dynamic motor primitive in the form of a kernel function that can be used as the covariance function of a Gaussian process. I will describe how latent force models can be used on their own, or in combination with hidden Markov models for segmenting motion templates.

Dr. Álvarez received a degree in Electronics Engineering (B. Eng.) with Honours, from Universidad Nacional de Colombia in 2004, a master degree in Electrical Engineering (M. Eng.) from Universidad Tecnológica de Pereira, Colombia in 2006, and a Ph.D. degree in Computer Science from The University of Manchester, UK, in 2011. After finishing his Ph.D., Dr. Álvarez joined the Department of Electrical Engineering at Universidad Tecnológica de Pereira, Colombia, where he was appointed as a Faculty member until Dec 2016. From January 2017, Dr. Álvarez was appointed as Lecturer in Machine Learning at the Department of Computer Science of the University of Sheffield, UK.

Dr. Álvarez is interested in machine learning in general, its interplay with mathematics and statistics, and its applications. In particular, his research interests include probabilistic models, kernel methods, and stochastic processes. He works on the development of new approaches and the application of Machine Learning in areas that include applied neuroscience, systems biology, and humanoid robotics.

Posted in Events

“Machine Learning Methods in Bioacoustics” – Call for Abstracts, IBAC 2017


We are pleased to announce a symposium on “Machine Learning Methods in Bioacoustics”, to be held as part of the 2017 International Bioacoustics Congress (Haridwar, India, 8-13 October 2017).

To submit an abstract, see: – Please ALSO send an e-mail with the title of your contribution to before 30th April 30th May.

Please forward this announcement to anyone who may be interested. We aim for a broad representation, across the diverse fields of practitioners who have an interest in using/developing machine learning methods for animal sounds.

  • Symposium chair: Dan Stowell
  • Deadline for abstracts: 30th April 30th May
Posted in Uncategorized

EUSIPCO Special Session on Bird Audio Signal Processing

Bird Audio Signal Processing

Special Session at


28 August – 2 September, 2017 – Kos Island, Greece

Theme and scope

This session will bring together research on an application domain of growing recent interest, and of high practical importance: signal processing and machine learning applied to the sounds of birds. Acoustic monitoring of species is an increasingly crucial tool in tracking population declines and migration movements affected by climate change. Detailed signal processing can also reveal scientific understanding of the evolutionary mechanisms operating on bird acoustic communication. What is needed is a set of tools for scalable and fully-automatic detection and analysis across a wide variety of bird sounds.

Workshops such as Listening in the Wild 2013/2015, the BirdClef Challenge 2014/2015/2016 and a special session at InterSpeech 2016 demonstrate the growing and active community in the area. Our session builds on this momentum, providing a focused European session. One component of this special session will be the outcomes of the Bird Audio Detection Challenge <>, which provided new datasets and saw bird detection algorithms developed by more than 20 teams from around the world. The session will also invite new research contributions in the broader emerging topic of bird audio signal processing.

For information about how to submit your paper please see the EUSIPCO website.


Important Dates

  • Paper submission: February 17, 2017
  • Decision notifications: May 25, 2017
  • Camera-ready papers: June 17, 2017


  • Dr Dan Stowell, Queen Mary University of London, London, UK.
  • Pr. Hervé Glotin, Scaled Acoustic BioDiversity Dept, Univ. of Toulon & Inst. Univ. de France.
  • Pr. Yannis Stylianou, Computer Science Dept, University of Crete.
  • Dr Mike Wood, University of Salford, Greater Manchester, UK.
Posted in Bird, Events

Bird Audio Detection: tips on building robust detectors

Recently we launched the Bird Audio Detection Challenge, providing new datasets and a contest to improve the state of the art in general-purpose bird detection in audio. In our previous post we talked about the problem of generalisation, demonstrating that automated detectors often have problems generalising beyond the previously-seen data conditions. In this post I’d like to share some signal-processing and machine-learning tips for building systems that can adapt to new conditions.

Signal processing tips:

  • Filtering. It’s quite common to filter out frequency bands that you’re not interested in. For example, a lot of birdsong analysis starts by filtering out frequencies below about 500 Hz or 1000 Hz, because a lot of the lower frequency end is dominated by sound caused by humans (traffic, aeroplane noise, speech) and also by wind noise. Removing the energy from these is removing distractions from the signal. BUT! Don’t forget that we have a wide range of birds in mind here – if you filter out too many frequencies, you might lose some frequencies containing energy from birds.
  • Noise reduction. background noise in these outdoor datasets varies quite a lot. In general it’s a good idea to remove it… but how? To be honest most noise reduction systems will not be able to fully remove noise from such diverse scenarios, but some basic noise reduction will probably help to get the signal standing out from the background a bit. A common thing to do is estimate the noise energy at each frequency – assuming that it’s constant through time – and then try to subtract it away. You can do this by spectral median filtering. Alternatively, if you are working with spectrograms, there are various noise-reduction methods designed for images, some of which are appropriate for spectrograms – check in your favourite toolbox what they actually do
  • Normalising the data. For example:
    • Normalising the amplitude of each sound file is a very simple and very common step: e.g. so that each file’s audio data fits in +-1, or so that it has zero mean and unit variance.
    • Spectral whitening is a more involved idea – it makes sure that each frequency band in an audio file has a consistent amount of energy. (You can do this by applying LPC filtering, or via a spectrogram analysis.) The good thing about this is that it takes away any frequency emphasis that might be caused by the microphone or other aspects of the environment.
    • Cepstral mean subtraction – this is something used in speech processing to remove the reverberation effect in the recording.

    There are a couple of things to watch out for when normalising. One is what to do with a signal that is pure zeros or pure DC – a lot of normalisations could give divide-by-zero errors in this case. Probably best to just skip those, leaving them at zero.

  • Choice of representation. Most people working in this area will work with a “default” representation of an audio signal as a spectrogram or as MFCCs. Are these optimal? In general people don’t seem to get a massive leap in performance by spending a lot of time tweaking the representation. However, here’s one recent example: Sandsten et al (2016) tested a standard Hann-windowed spectrogram against a multitaper spectrogram, and found a consistent moderate benefit in their classification results.

Now here’s a question: in which order should you apply these steps? Does it matter?


Machine learning tips:

Not every detector is based on machine learning, but it’s a powerful way to go beyond hand-crafted decision rules and make use of existing data examples.

In machine learning research, there are lots of discussions about generalisability and robustness. A useful reference is this textbook: Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. “Covariate shift” is a term referring to a specific type of generalisation problem, very relevant here.

More generally here are some suggestions:

  • Multi-condition training / data augmentation. if you’re training a neural net or something else, it often helps to include data from various conditions in the training procedure. In the BAD Challenge we specify exactly which datasets you may use, and you may not include other types of dataset – BUT you can use data augmentation to create modified copies of the data in various ways. You could apply low-bitrate MP3 compression to the audio files, add noise, or superimpose files on top of each other (take care of what that implies for labelling – should your augmented data items have the same annotations as the files they’re based on?). The hope is that it will therefore learn better to focus on the aspects of the signal that really matter.
    There are programming tools to help you degrade your audio files! For python there’s muda and for matlab there’s the audio degradation toolbox.

    • Note that, again, it might make a difference if you do data augmentation before or after the preprocessing mentioned above. If you do data augmentation beforehand, then that’s a closer match to the idea that you’re simply generating extra data points to feed into the pipeline, which is fair enough – but is there much point adding noise, for example, if you’re then going to apply noise reduction? If you do it afterward (i.e. just before you apply your classifier) then this exposes the classifier to a wider range of “irrelevant modifications”.
  • Self-adaptation. We’ve already told you that the public data and the private testing data come from different recording scenarios. So, even after all the normalisation we probably expect there to be some differences (in the test set) versus the data you’ve been developing with. But your algorithm gets to see all the test data. What if it could automatically tweak its own settings to give itself the best chance of working well, even when it doesn’t know what the true answers are? This is studied in speech recognition, for example, where speakers’ voices might be different from those in the training data, so researchers discuss speaker adaptation techniques. There’s a large amount of speech recognition research, so maybe you’ll find some useful tips from speaker adaptation.
    One weird trick used in machine learning is as follows: first run your algorithm and get its predicted outputs; then pretend that those outputs are the actual groundtruth, and use them to tune some algorithm parameters (e.g. re-train the classifier); then re-run the algorithm and take its new decisions as the final output. This relies on the fact that in many cases, the algorithm’s first run – although imperfect – is getting results roughly along the right lines, and so is giving a “weak training signal” that is, one hopes, better than nothing.
  • Regularisation. Machine-learning methods often can learn anything they want about a dataset, which is good because it means they can discover unknown ways to make predictions, but it’s bad because it means they can rely too heavily on the patterns in their training data. A good machine learning algorithm is one that avoids overfitting. For many methods you can do this by regularisation, which is essentially adding a soft constraint on the learnt parameters so that they can’t simply fit as tightly as they like to the training data. (If you’re a Bayesian, you can also think of using a strong prior on the parameter values.)
  • Combining models. imagine you have one detector that’s good at detecting high-pitched twittery birds, and one that’s good at detecting lower-pitched birds. You can combine their outputs simply by averaging them together – or perhaps by a weighted average, weighted according to how much you trust their outputs. This is the basic idea of model averaging, also known as late fusion, and it’s often used to get a few percentage-points better performance by people working on machine-learning challenges. The models could be explicitly designed for different purposes (as I suggested above), or they could simply be multiple attempts to solve the problem. More sophisticated than averaging, alternatively you could combine the outputs of the various detectors using yet another machine-learning algorithm.
    One good thing about combining models is it’s also an easy way to get outputs that are not just 1 or 0. Remember that for the challenge we ideally want systems that output a value BETWEEN 1 and 0 depending on their confidence. If you’re working with methods that can only give you a black-or-white yes/no answer, then averaging over more than one system can at least give you a bit more nuance in the output numbers.

But remember: if you’re developing a system to perform detection, the crucial thing to do is dig below the headline score. Don’t just add/remove processing steps and look at the numbers that pop out. Find some examples that your system gets wrong (false positives and false negatives) and see if you can identify any tendencies that tell you why it’s doing what it’s doing.

> More about the Bird Audio Detection challenge

Posted in Bird, Uncategorized

Bird Audio Detection: baseline tests – and the problem of generalisation

Recently we launched the Bird Audio Detection Challenge, providing new datasets and a contest to improve the state of the art in general-purpose bird detection in audio.

So what is the state of the art? For a thorough answer, you can read our survey paper recently presented at the IEEE MLSP workshop 2016. For a numerical estimate of quality, we can take the data we’ve published, apply a couple of off-the-shelf methods, and see what happens…

An important thing we want to encourage through this challenge is generalisation. This means we want a detector that works across a wide range of species, and across a wide range of environments – without having to be manually tweaked each time it’s used in a different context. That’s why we published two very different datasets for you to develop your ideas on. One dataset is miscellaneous field recordings (containing birds about 25% of the time, and various other background sounds) from all around the world, while one is crowdsourced mobile-phone recordings from our bird classification app Warblr (containing birds about 75% of the time, and often with the sound of people mixed in) from all around the UK. (In the final test we’ll also have recordings from a site of high scientific interest: the Chernobyl Exclusion Zone.)

So here’s a question: if we take an existing machine-learning method, and “train” it using one of those datasets, can it generalise well enough to make sense of the other dataset?

We’ve looked at this question using two audio classifiers which we have here:

  1. The first is the classic “MFCCs and GMMs” method, which has been used in too many research papers to mention. It’s often a standard not-great baseline against which other methods are compared, but in some cases it may itself be good enough for the job. We used our open-source Python implementation of this called smacpy.
  2. The second is a variant of the method that we found useful for species classification: spherical k-means feature learning, followed by a Random Forest classifier. Described in more detail in our journal paper on large-scale classification of bird sounds. (Labelled as “skfl” in the plot below.)

So how well did they perform? Here’s a graph:

Plot of performance of classifiers, discussed below


To understand the AUC value (Area Under the ROC Curve) on the y-axis: 100% is what we want, perfect detection, while 50% is the value you get for “pure ignorance” (e.g. flipping a coin to decide how to label each file).

The first thing you notice is that our “modern” classifier (in yellow and green) performs much better than the baseline MFCC+GMM classifier (orange and blue). OK, so far so good. In fact in the first column the modern classifier gets 100% performance – but that’s to be expected, since in the first column we test each system using the training data that it’s already seen.

Why doesn’t the MFCC+GMM classifier get 100% on the data it’s already seen? Essentially, it’s not powerful/flexible enough to create a map that fully captures all the yes/no decisions it has seen. So of course that’s showing a bit of a limitation. But when we look at the second column – testing the classifier using the same type of data as it was trained with, but using new unseen items – the “mediocre” performance continues at about the same level. The benefit of this inflexible classifier is that it has strongly avoided “overfitting” to any particular set of datapoints, even though it’s showing that it’s not great at the task.

The more powerful classifier does take a bit of a hit when we test it on unseen data – down from 100% to about 85%. I wouldn’t really have expected it to get 100% anyway, but you can see that if you’d only run the first test you might get a false impression…! Even though the performance goes down to 85% it is still stomping on the other classifier, outdoing it by a substantial margin.

This middle column is the kind of result you typically see quoted in research on classifiers. Train it on some data, test it on some similar-but-held-out data. But the third column is where it gets even more interesting!

In the third column, we test each classifier using data from the other dataset (warblr if we trained on ff1010; or ff1010 if we trained on warblr). Importantly, we’re still asking the same question (“Is there any bird or not?”), but the data represents different conditions under which we might ask that question. It certainly contains types of foreground and background sound that the classifier has never been exposed to during its training!

This is a test of generalisation, and you can see that the baseline MFCC+GMM classifier (“smacpy”) really falls at this hurdle. Its quality falls right down to 60% or lower (pretty close to the 50% of “perfect ignorance”!). The modern skfl classifier takes a bit of a hit too, falling to just below 80%. Not a major catastrophe, but below the performance we want to see for a general-purpose detector out in the field.

The two datasets represent related but different types of data collection scenario, and it’s not very surprising that a standard algorithm doesn’t know that we want it to generalise to those other related scenarios, not having been told anything about those other scenarios. In fact, a more “powerful” algorithm may often exhibit this issue more clearly, because it has the freedom to make use of the input data in all kinds of ways that might not be advisable! This issue has recently had a lot of attention in the “deep learning” field, where it has been found (among other things) that many different machine learning systems can be easily fooled by examples that lie outside the domain for which they were trained.

So what can we do about this? There are various techniques that might be useful, and in designing this challenge with bird audio we don’t want to pin things down about what approaches might work well. Maybe it’s possible to normalise the data, removing the factors that make the sound scenarios different from one another. Maybe it’s best not to lean too heavily on machine learning. Maybe it’s best to learn from existing work on how to build more generalisable systems. Maybe it’s best to use a few different methods and combine the results. In a future post we will go into this in more detail, outlining approaches that some researchers have been exploring.

If you’re developing a system to perform detection, the crucial thing to do is dig below the headline score. Don’t just look at the numbers. Find some examples that your system gets wrong (false positives and false negatives) and see if you can identify any tendencies that tell you why it’s doing what it’s doing.

> More about the Bird Audio Detection challenge

Posted in Bird, Uncategorized

Announcing the Bird Audio Detection Challenge

On behalf of the organising committee and the IEEE AASP TC,* we invite you to consider the first Bird Audio Detection Challenge:

Detecting bird sounds in audio is an important basic task in various applications. The current generation of software tools require manual work from the user: to choose the algorithm, to set the settings, and to post-process the results. This is holding bioacoustics back in embracing its “big data” era: let’s make this better!

We look forward to participation from students, researchers and developers. We have prizes of £500 and €500 for the best submissions.

Please share this to anyone who may be interested. This is a cross-disciplinary challenge, and to solve it we need to reach people in many different subject areas.

* the AASP TC is the Audio & Acoustic Signal Processing Technical Committee of the IEEE Signal Processing Society

Posted in Bird, Uncategorized

Three Awards at ISMIR 2016!

Researchers at the Machine Listening Lab at QMUL won three of five awards at the 2016 International Symposium on Music Information Retrieval!

Congratulations to everyone for their hard work!

Posted in Uncategorized

Singing transcription project started

A new collaborative project that will address the problem of automatic transcription of multiple singers has been launched by Queen Mary University of London and Federal University of Rio Grande do Sul (UFRGS – Brazil). The £24k project, entitled “Automatic transcription of polyphonic music recordings“, is funded by the Royal Academy of Engineering through its Newton Research Collaboration Programme, and will last 12 months.

The project involves Dr Rodrigo Schramm (UFRGS) and Dr Emmanouil Benetos (QMUL), and its main goals are to:
1) Propose and develop new technologies to perform automatic transcription of audio recordings generated by multiple singers.
2) Establish a collaborative network between researchers from UFRGS/Brazil and C4DM/UK.
3) Introduce new technologies for music information retrieval which can assist music education in Brazil, especially in the context of e-learning.

Posted in Uncategorized