Deep Learning for Audio Segmentation and Intelligent Remixing

Venkatesh, Satvik

dc.contributor.supervisor	Miranda, Eduardo
dc.contributor.author	Venkatesh, Satvik
dc.contributor.other	School of Society and Culture	en_US
dc.date.accessioned	2022-12-19T10:09:16Z
dc.date.available	2022-12-19T10:09:16Z
dc.date.issued	2022
dc.identifier	10577651	en_US
dc.identifier.uri	http://hdl.handle.net/10026.1/20092
dc.description.abstract	Audio segmentation divides an audio signal into homogenous sections such as music and speech. It is useful as a preprocessing step to index, store, and modify audio recordings, radio broadcasts and TV programmes. Machine learning models for audio segmentation are generally trained on copyrighted material, which cannot be shared across research groups. Furthermore, annotating these datasets is a time-consuming and expensive task. In this thesis, we present a novel approach that artificially synthesises data that resembles radio signals. We replicate the workflow of a radio DJ in mixing audio and investigate parameters like fade curves and audio ducking. Using this approach, we obtained state-of-the-art performance for music-speech detection on in-house and public datasets. After demonstrating the efficacy of training set synthesis, we investigate how audio ducking of background music impacts the precision and recall of the machine learning algorithm. Interestingly, we observed that the minimum level of audio ducking preferred by the machine learning algorithm was similar to that of human listeners. Furthermore, we observe that our proposed synthesis technique outperforms real-world data in some cases and serves as a promising alternative. This project also proposes a novel deep learning system called You Only Hear Once (YOHO), which is inspired by the YOLO algorithm popularly adopted in Computer Vision. We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification. The relative improvement for F-measure of YOHO, compared to the state-of-the-art Convolutional Recurrent Neural Network, ranged from 1% to 6% across multiple datasets. As YOHO predicts acoustic boundaries directly, the speed of inference and post-processing steps are 6 times faster than frame-based classification. Furthermore, we investigate domain generalisation methods such as transfer learning and adversarial training. We demonstrated that these methods helped our algorithm perform better in unseen domains. In addition to audio segmentation, another objective of this project is to explore real-time radio remixing. This is a step towards building a customised radio and consequently, integrating it with the schedule of the listener. The system would remix music from the user’s personal playlist and play snippets of diary reminders at appropriate transition points. The intelligent remixing is governed by the underlying audio segmentation and other deep learning methods. We also explore how individuals can communicate with intelligent mixing systems through non-technical language. We demonstrated that word embeddings help in understanding representations of semantic descriptors.	en_US
dc.language.iso	en
dc.publisher	University of Plymouth
dc.rights	Attribution 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/us/	*
dc.subject	audio segmentation	en_US
dc.subject	sound event detection	en_US
dc.subject	deep learning	en_US
dc.subject	music-speech detection	en_US
dc.subject	intelligent remixing	en_US
dc.subject	radio	en_US
dc.subject	audio classification	en_US
dc.subject	music information retrieval	en_US
dc.subject.classification	PhD	en_US
dc.title	Deep Learning for Audio Segmentation and Intelligent Remixing	en_US
dc.type	Thesis
plymouth.version	publishable	en_US
dc.identifier.doi	http://dx.doi.org/10.24382/778
dc.identifier.doi	http://dx.doi.org/10.24382/778
dc.rights.embargoperiod	No embargo	en_US
dc.type.qualification	Doctorate	en_US
rioxxterms.funder	Engineering and Physical Sciences Research Council	en_US
rioxxterms.identifier.project	Radio Me: Real-time Radio Remixing for people with mild to moderate dementia who live alone, incorporating Agitation Reduction, and Reminders	en_US
rioxxterms.version	NA
plymouth.orcid.id	http://orcid.org/0000-0001-5244-3020	en_US