Authors: Jens Madsen
Abstract: With the ever-growing popularity and availability of digital music through streaming services and digital download, making sense of the millions of songs, is ever more pertinent. However the traditional approach of creating music systems has treated songs like items in a store, like books and movies. However music is special, having origins in a number of evolutionary adaptations.
The fundamental needs and goals of a users use of music, was investigated to create the next generation of music systems. People listen to music to regulate their mood and emotions was found to be the most important fundamental reason.
(Mis)matching peoples mood with the emotions expressed in music was found to be an essential underlying mechanism, people use to regulate their emotions.
This formed the basis and overall goal of the thesis, to investigate how to create a predictive model of emotions expressed in music. To use in the next generation of music systems.
The thesis was divided into three main topics involved in creating a predictive model 1) Elicitation of emotion, 2) Audio representation and 3) Modelling framework, associating the emotion and audio representation, allowing to predict the emotions expressed in music.
The traditional approach of quantifying musical stimuli on the valence and arousal representation of emotions using continuous or likert scales was questioned. An outline of a number of bias and the so-called confidence effect when using bipolar scales led to the use of relative scales in the form of pairwise comparisons. One issue with pairwise comparisons is the scaling, this was solved using an active learning approach through a Gaussian Process model.
Traditional audio representation disregards all temporal information in audio features used for modelling the emotions expressed in music. Therefore a probabilistic feature representation framework was introduced enabling both temporal and non-temporal aspects to be coded in discrete and continuous features. Generative models are estimated for each feature time-series and used in a discriminative setting using the Probability Product Kernel (PPK) allowing the use of this approach in any kernel machine.
To model the pairwise comparisons directly, a Generalized Linear Model, a kernel extension and a Gaussian Process model were used. These models can predict the ranking of songs on the valence and arousal dimensions directly. Furthermore use of the PPK allowed to find optimal combinations of both feature and feature representation using Multiple Kernel Learning.