Machine Listening Lab symposium 2018

On 29th March 2018 we held a Machine Listening Lab symposium to gather together people across the university who are doing work in machine listening.



Katerina Kosta (Jukedeck)

The programme of talks included:

  • Emmanouil Benetos (QMUL):
    Machine listening for music and everyday sounds: the year ahead
  • Rob Lachlan (QMUL):
    Learning about bird song learning via inference from population-level variation
  • Katerina Kosta (invited guest speaker):
    Creating music at Jukedeck
  • Michael McLoughlin (QMUL):
    Sea to Farm: Bioacoustics in Animal Behaviour and Welfare

Plus 12 lightning talks from students and postdocs, and from Rodrigo Schramm visiting from Brazil to update us on his work since spending time with us last year.


Rodrigo Schramm (UFRGS)

Thank you to everyone who took part!



(Funded by Dan Stowell EPSRC Research Fellowship EP/L020505/1)

Posted in Uncategorized

Many C4DM papers accepted for ICASSP 2018

C4DM researchers have had a lot of success this year in being accepted for ICASSP 2018, the IEEE’s International Conference on Acoustics, Speech and Signal Processing. Most of these papers are led by C4DM PhD students, on MIR and Machine Listening topics:

  • “A Deeper Look At Gaussian Mixture Model Based Anti-Spoofing Systems” by Bhusan Chettri and Bob L. Sturm
  • “Towards Complete Polyphonic Music Transcription: Integrating Multi-Pitch Detection and Rhythm Quantization” by Eita Nakamura, Emmanouil Benetos, Kazuyoshi Yoshii, and Simon Dixon
  • “Polyphonic Music Sequence Transduction With Meter-Constrained LSTM Networks” by Adrien Ycart and Emmanouil Benetos
  • “Feature Design Using Audio Decomposition for Intelligent Control of the Dynamic Range Compressor” by Di Sheng and György Fazekas
  • Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction” by Daniel Stoller, Sebastian Ewert, Simon Dixon
  • “Similarity Measures For Vocal-Based Drum Sample Retrieval Using Deep Convolutional Auto-Encoders” by Adib Mehrabi, Kuenwoo Choi, Simon Dixon, Mark Sandler
  • Shift-Invariant Kernel Additive Modelling for Audio Source Separation” by Delia Fano Yela, Sebastian Ewert, Ken O’Hanlon, Mark B. Sandler
  • “Improved detection of semi-percussive onsets in audio using temporal reassignment” by K. O’Hanlon and M.B. Sandler

See you all in Calgary!

Posted in Publications

Machine Listening Lab 2017: The year in review

2017 has been a fascinating year for the Machine Listening Lab. Here are the headlines!

Grant success and projects:

  • Rob Lachlan, along with David Clayton and Dan Stowell, were awarded a BBSRC grant for a £659,000 project to study “Machine Learning for Bird Song Learning” (BB/R008736/1).
  • Emmanouil Benetos was awarded an EPSRC first grant for a £122,299 project to study “Integrating sound and context recognition for acoustic scene analysis” (EP/R01891X/1).
  • Bob Sturm (with co-investigator Oded Ben-Tal of Kingston University) was awarded AHRC follow-on funding (AH/R004706/1) for a £70,000 project titled, “Engaging three user communities with applications and outcomes of computational music creativity”.
  • Emmanouil Benetos is co-investigator (Co-I) for the European Training NetworkNew Frontiers in Music Information Processing” (MIP-Frontiers), with Simon Dixon as PI (and network coordinator) and Mark Sandler as Co-I. The budget is €819,863 for QMUL, €3,937,088 total. Duration: April 2018 – March 2022.
  • The LiveQuest project began – a collaboration between QMUL and 4 other institutions in the UK and China, to develop IoT sensing devices to aid with chicken welfare monitoring. The project is led by QMUL’s Yue Gao; on the machine listening side, Becky Stewart and Alan McElligott are co-investigators.
  • The Machine Listening Lab received an NVIDIA GPU grant for a Titan Xp GPU (RRP £1,149).


  • HORSE 2017, the second workshop on “Horses” in applied machine learning, was organised and led by Bob Sturm at QMUL, a one-day workshop with a range of international speakers in machine learning.
  • QMUL Festival of Ideas (June 29 2017) – Dan Stowell gave a public talk on “Can we decode the Dawn Chorus”, and the Machine Listening Lab gave a concert of 3 parts (Sturm’s folk-rnn, Ewerts’ one-handed Gould, Stowell’s thrush nightingale transcription). It was attended by staff from many different departments around the college and was named as one of the highlights of the festival.
  • MLLab members led sessions at international research conferences:
    Bob Sturm co-organised the ML4Audio workshop @ NIPS 2017 (USA).
    Dan Stowell organised and chaired special sessions at EUSIPCO (Greece), IBAC (India), and chaired a session at DCASE (Germany).
    Emmanouil Benetos was Programme Co-chair (with Emmanuel Vincent) for the DCASE 2017 Workshop. Also Programme Committee member (i.e. meta-reviewer) for ISMIR 2017.
  • Emmanouil Benetos was invited keynote speaker for Digital Musicology Symposium, London, Sep 2017
  • 30 teams from around the world took part in the Bird Audio Detection Challenge, led by Dan Stowell. Many of the best performing methods were presented at EUSIPCO 2017.
  • Bob Sturm organised generative music concerts featuring many algorithms and composers, at venues around London: Partnerships in May, and Music in the Age of Artificial Creation in Nov


New people, and farewells

This year we said farewell to Sebastian Ewert – a founding member and co-leader of the MLLab, he has now moved on to Spotify where he will be a Senior Research Scientist. And also to Alan McElligott, an affiliated academic of the MLLab. He’s moved on to Roehampton University where he is a Reader in Animal Behaviour.

Michael McLoughlin joined us as a postdoc on the LiveQuest farm chicken welfare technology project mentioned above. Welcome!

Other news from the MLLab:

  • Two chapters authored in a new Springer textbook on Sound Scene Analysis: a chapter written by Dan Stowell, and a chapter lead-authored by Emmanouil Benetos (with Stowell and Plumbley)
  • Dan Stowell appeared live on on the flagship morning shows of BBC Radio 4 and the BBC World Service (March 20th) talking about birdsong and machine listening.
  • Bob Sturm appeared on French national television (Canal+, Nov 18th)  discussing whether artificial intelligence would take over from music artists.
  • Dan Stowell and Emmanouil Benetos were invited visitors to Beijing University of Posts and Telecommunications (BUPT) under its “International Academic Talents” programme.

Visiting researchers:
Rodrigo Schramm (UFRGS, Brazil, Aug 2016 – Aug 2017)
Mina Mounir (KU Leuven, Belgium, May 2017)
Hanna Pamula (AGH, Poland, June 2017 – August 2017)
Andrew McLeod (University of Edinburgh, Aug 2017)
Qing Zhou (Xi’an Jiaotang University, China, Oct 2017 – March 2018)

Journal articles:

D. Stowell, E. Benetos, and L. F. Gill, “On-bird Sound Recordings: Automatic Acoustic Recognition of Activities and Contexts“, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1193-1206, Jun. 2017.

E. Benetos, G. Lafay, M. Lagrange and M. D. Plumbley, “Polyphonic Sound Event Tracking using Linear Dynamical Systems“, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1266-1277, Jun. 2017.

S Wang, S Ewert, S Dixon, “Identifying Missing and Extra Notes in Piano Recordings Using Score-Informed Dictionary Learning“, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol 25, no. 6, pp. 1877-1889, Jun. 2017.

F Spielmann, A Helmlinger, J Simonnot, T Fillon, G Pellerin, BL Sturm, “Zoom arrière : L’ethnomusicologie a` l’ère du Big Data”, Cahiers d’ethnomusicologie

BL Sturm, O Ben-Tal, “Bringing the models back to music practice: The evaluation of deep learning approaches to music transcription modelling and generation”, Journal of Creative Music Systems

S Abdallah, E Benetos, N Gold, S Hargreaves, T Weyde, D Wolff, “The digital music lab: A big data infrastructure for digital musicology“, Journal on Computing and Cultural Heritage (JOCCH) 10 (1), 2

Posted in Uncategorized

MLLab contributions to new Springer book in Sound Scene Analysis

MLLab members contributed two chapters to an upcoming book published by Springer on “Computational Analysis of Sound Scenes and Events“. The book, which is edited by Tuomas Virtanen, Mark D. Plumbley and Dan Ellis, will be published on 20 October 2017. The two book chapters contributed by MLLab members are:

  • E. Benetos, D. Stowell, and M. D. Plumbley, “Approaches to complex sound scene analysis“, in Computational Analysis of Sound Scenes and Events, T. Virtanen, M. D. Plumbley, and D. P. W. Ellis (eds.), Springer, Oct. 2017.
  • D. Stowell, “Computational Bioacoustic Scene Analysis“, in Computational Analysis of Sound Scenes and Events, T. Virtanen, M. D. Plumbley, and D. P. W. Ellis (eds.), Springer, Oct. 2017.
Posted in Publications

Best paper award at 2017 AES Conference on Semantic Audio

As part of the 2017 AES Conference on Semantic Audio, paper “Automatic transcription of a cappella recordings from multiple singers” by Rodrigo Schramm and Emmanouil Benetos has received the conference’s Best Paper Award. A postprint of the paper can be found here.

Posted in Publications

MLLab research in the IEEE/ACM TASLP special issue on Sound Scene and Event Analysis

Two papers authored by members of the Machine Listening Lab have been published at a special issue of IEEE/ACM Transactions on Audio, Speech and Language Processing on “Sound Scene and Event Analysis”:

Posted in Publications

Seminar: Mauricio Álvarez, Sequential latent force models for segmenting motor primitives

As part of the C4DM seminar series, the Machine Listening Lab and the Centre for Intelligent Sensing jointly present Mauricio Álvarez giving a talk about Sequential latent force models for segmenting motor primitives.

  • Date and Time: Wednesday, 24th May 2017, at 4:00pm
  • Place: Room GC 2.22, Graduate Centre, Queen Mary University of London, Mile End Road, London E1 4NS. (Directions)

Motor primitives are basic representations of human motion that, in a similar way to phonemes in a language, can be used to compose complex movements used for imitation learning in humanoid robotics. The first step when using motor primitives in imitation learning consists of defining a basic vocabulary of motor skills, according to a particular task that the humanoid robot is supposed to perform. Such vocabularies are usually learned from multivariate time course data. In this talk, I will describe two alternatives for segmentation of motor primitives from multivariate time course data that involve the use of latent force models. A latent force model encodes a dynamic motor primitive in the form of a kernel function that can be used as the covariance function of a Gaussian process. I will describe how latent force models can be used on their own, or in combination with hidden Markov models for segmenting motion templates.

Dr. Álvarez received a degree in Electronics Engineering (B. Eng.) with Honours, from Universidad Nacional de Colombia in 2004, a master degree in Electrical Engineering (M. Eng.) from Universidad Tecnológica de Pereira, Colombia in 2006, and a Ph.D. degree in Computer Science from The University of Manchester, UK, in 2011. After finishing his Ph.D., Dr. Álvarez joined the Department of Electrical Engineering at Universidad Tecnológica de Pereira, Colombia, where he was appointed as a Faculty member until Dec 2016. From January 2017, Dr. Álvarez was appointed as Lecturer in Machine Learning at the Department of Computer Science of the University of Sheffield, UK.

Dr. Álvarez is interested in machine learning in general, its interplay with mathematics and statistics, and its applications. In particular, his research interests include probabilistic models, kernel methods, and stochastic processes. He works on the development of new approaches and the application of Machine Learning in areas that include applied neuroscience, systems biology, and humanoid robotics.

Posted in Events

“Machine Learning Methods in Bioacoustics” – Call for Abstracts, IBAC 2017


We are pleased to announce a symposium on “Machine Learning Methods in Bioacoustics”, to be held as part of the 2017 International Bioacoustics Congress (Haridwar, India, 8-13 October 2017).

To submit an abstract, see: – Please ALSO send an e-mail with the title of your contribution to before 30th April 30th May.

Please forward this announcement to anyone who may be interested. We aim for a broad representation, across the diverse fields of practitioners who have an interest in using/developing machine learning methods for animal sounds.

  • Symposium chair: Dan Stowell
  • Deadline for abstracts: 30th April 30th May
Posted in Uncategorized

EUSIPCO Special Session on Bird Audio Signal Processing

Bird Audio Signal Processing

Special Session at


28 August – 2 September, 2017 – Kos Island, Greece

Theme and scope

This session will bring together research on an application domain of growing recent interest, and of high practical importance: signal processing and machine learning applied to the sounds of birds. Acoustic monitoring of species is an increasingly crucial tool in tracking population declines and migration movements affected by climate change. Detailed signal processing can also reveal scientific understanding of the evolutionary mechanisms operating on bird acoustic communication. What is needed is a set of tools for scalable and fully-automatic detection and analysis across a wide variety of bird sounds.

Workshops such as Listening in the Wild 2013/2015, the BirdClef Challenge 2014/2015/2016 and a special session at InterSpeech 2016 demonstrate the growing and active community in the area. Our session builds on this momentum, providing a focused European session. One component of this special session will be the outcomes of the Bird Audio Detection Challenge <>, which provided new datasets and saw bird detection algorithms developed by more than 20 teams from around the world. The session will also invite new research contributions in the broader emerging topic of bird audio signal processing.

For information about how to submit your paper please see the EUSIPCO website.


Important Dates

  • Paper submission: February 17, 2017
  • Decision notifications: May 25, 2017
  • Camera-ready papers: June 17, 2017


  • Dr Dan Stowell, Queen Mary University of London, London, UK.
  • Pr. Hervé Glotin, Scaled Acoustic BioDiversity Dept, Univ. of Toulon & Inst. Univ. de France.
  • Pr. Yannis Stylianou, Computer Science Dept, University of Crete.
  • Dr Mike Wood, University of Salford, Greater Manchester, UK.
Posted in Bird, Events

Bird Audio Detection: tips on building robust detectors

Recently we launched the Bird Audio Detection Challenge, providing new datasets and a contest to improve the state of the art in general-purpose bird detection in audio. In our previous post we talked about the problem of generalisation, demonstrating that automated detectors often have problems generalising beyond the previously-seen data conditions. In this post I’d like to share some signal-processing and machine-learning tips for building systems that can adapt to new conditions.

Signal processing tips:

  • Filtering. It’s quite common to filter out frequency bands that you’re not interested in. For example, a lot of birdsong analysis starts by filtering out frequencies below about 500 Hz or 1000 Hz, because a lot of the lower frequency end is dominated by sound caused by humans (traffic, aeroplane noise, speech) and also by wind noise. Removing the energy from these is removing distractions from the signal. BUT! Don’t forget that we have a wide range of birds in mind here – if you filter out too many frequencies, you might lose some frequencies containing energy from birds.
  • Noise reduction. background noise in these outdoor datasets varies quite a lot. In general it’s a good idea to remove it… but how? To be honest most noise reduction systems will not be able to fully remove noise from such diverse scenarios, but some basic noise reduction will probably help to get the signal standing out from the background a bit. A common thing to do is estimate the noise energy at each frequency – assuming that it’s constant through time – and then try to subtract it away. You can do this by spectral median filtering. Alternatively, if you are working with spectrograms, there are various noise-reduction methods designed for images, some of which are appropriate for spectrograms – check in your favourite toolbox what they actually do
  • Normalising the data. For example:
    • Normalising the amplitude of each sound file is a very simple and very common step: e.g. so that each file’s audio data fits in +-1, or so that it has zero mean and unit variance.
    • Spectral whitening is a more involved idea – it makes sure that each frequency band in an audio file has a consistent amount of energy. (You can do this by applying LPC filtering, or via a spectrogram analysis.) The good thing about this is that it takes away any frequency emphasis that might be caused by the microphone or other aspects of the environment.
    • Cepstral mean subtraction – this is something used in speech processing to remove the reverberation effect in the recording.

    There are a couple of things to watch out for when normalising. One is what to do with a signal that is pure zeros or pure DC – a lot of normalisations could give divide-by-zero errors in this case. Probably best to just skip those, leaving them at zero.

  • Choice of representation. Most people working in this area will work with a “default” representation of an audio signal as a spectrogram or as MFCCs. Are these optimal? In general people don’t seem to get a massive leap in performance by spending a lot of time tweaking the representation. However, here’s one recent example: Sandsten et al (2016) tested a standard Hann-windowed spectrogram against a multitaper spectrogram, and found a consistent moderate benefit in their classification results.

Now here’s a question: in which order should you apply these steps? Does it matter?


Machine learning tips:

Not every detector is based on machine learning, but it’s a powerful way to go beyond hand-crafted decision rules and make use of existing data examples.

In machine learning research, there are lots of discussions about generalisability and robustness. A useful reference is this textbook: Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. “Covariate shift” is a term referring to a specific type of generalisation problem, very relevant here.

More generally here are some suggestions:

  • Multi-condition training / data augmentation. if you’re training a neural net or something else, it often helps to include data from various conditions in the training procedure. In the BAD Challenge we specify exactly which datasets you may use, and you may not include other types of dataset – BUT you can use data augmentation to create modified copies of the data in various ways. You could apply low-bitrate MP3 compression to the audio files, add noise, or superimpose files on top of each other (take care of what that implies for labelling – should your augmented data items have the same annotations as the files they’re based on?). The hope is that it will therefore learn better to focus on the aspects of the signal that really matter.
    There are programming tools to help you degrade your audio files! For python there’s muda and for matlab there’s the audio degradation toolbox.

    • Note that, again, it might make a difference if you do data augmentation before or after the preprocessing mentioned above. If you do data augmentation beforehand, then that’s a closer match to the idea that you’re simply generating extra data points to feed into the pipeline, which is fair enough – but is there much point adding noise, for example, if you’re then going to apply noise reduction? If you do it afterward (i.e. just before you apply your classifier) then this exposes the classifier to a wider range of “irrelevant modifications”.
  • Self-adaptation. We’ve already told you that the public data and the private testing data come from different recording scenarios. So, even after all the normalisation we probably expect there to be some differences (in the test set) versus the data you’ve been developing with. But your algorithm gets to see all the test data. What if it could automatically tweak its own settings to give itself the best chance of working well, even when it doesn’t know what the true answers are? This is studied in speech recognition, for example, where speakers’ voices might be different from those in the training data, so researchers discuss speaker adaptation techniques. There’s a large amount of speech recognition research, so maybe you’ll find some useful tips from speaker adaptation.
    One weird trick used in machine learning is as follows: first run your algorithm and get its predicted outputs; then pretend that those outputs are the actual groundtruth, and use them to tune some algorithm parameters (e.g. re-train the classifier); then re-run the algorithm and take its new decisions as the final output. This relies on the fact that in many cases, the algorithm’s first run – although imperfect – is getting results roughly along the right lines, and so is giving a “weak training signal” that is, one hopes, better than nothing.
  • Regularisation. Machine-learning methods often can learn anything they want about a dataset, which is good because it means they can discover unknown ways to make predictions, but it’s bad because it means they can rely too heavily on the patterns in their training data. A good machine learning algorithm is one that avoids overfitting. For many methods you can do this by regularisation, which is essentially adding a soft constraint on the learnt parameters so that they can’t simply fit as tightly as they like to the training data. (If you’re a Bayesian, you can also think of using a strong prior on the parameter values.)
  • Combining models. imagine you have one detector that’s good at detecting high-pitched twittery birds, and one that’s good at detecting lower-pitched birds. You can combine their outputs simply by averaging them together – or perhaps by a weighted average, weighted according to how much you trust their outputs. This is the basic idea of model averaging, also known as late fusion, and it’s often used to get a few percentage-points better performance by people working on machine-learning challenges. The models could be explicitly designed for different purposes (as I suggested above), or they could simply be multiple attempts to solve the problem. More sophisticated than averaging, alternatively you could combine the outputs of the various detectors using yet another machine-learning algorithm.
    One good thing about combining models is it’s also an easy way to get outputs that are not just 1 or 0. Remember that for the challenge we ideally want systems that output a value BETWEEN 1 and 0 depending on their confidence. If you’re working with methods that can only give you a black-or-white yes/no answer, then averaging over more than one system can at least give you a bit more nuance in the output numbers.

But remember: if you’re developing a system to perform detection, the crucial thing to do is dig below the headline score. Don’t just add/remove processing steps and look at the numbers that pop out. Find some examples that your system gets wrong (false positives and false negatives) and see if you can identify any tendencies that tell you why it’s doing what it’s doing.

> More about the Bird Audio Detection challenge

Posted in Bird, Uncategorized