Speaker recognition implemented as part of a video-editing system

Tveita, Sebastian Rojas

dc.contributor.author	Tveita, Sebastian Rojas
dc.date.accessioned	2019-09-18T06:35:33Z
dc.date.available	2019-09-18T06:35:33Z
dc.date.issued	2019-06-29
dc.date.submitted	2019-06-28T22:00:08Z
dc.identifier.uri	https://hdl.handle.net/1956/20860
dc.description.abstract	With machine learning rising to prominence over the last decades, a lot of companies are doing research on how it can be applied in their products or production. Some of these companies have used machine learning with a good deal of success. This thesis proposes a solution for integrating speaker recognition in a video-editing system. The proposed solution is a proof-of-concept pipeline that is hooked into a web-based video-editing system made by a company called Vizrt\cite{vrt}. This pipeline takes a video and performs speaker diarization and classification on the audio from the video. To achieve this, two types of models are applied; Gaussian Mixture Models to create a Universal Background Model, and models applying the i-vector approach for use in the clustering. From the results of the machine learning algorithms, the pipeline will produce timecodes that are sent to the video-editing system. These timecodes show where different speakers are talking. This information will be presented to the user in the system UI, where the user will have the option of correcting the results from the diarization and classification. The pipeline also adopts the results from the algorithms to provide further functionality. By using the generated timecodes, the pipeline is able to extract training data that is partitioned according to the speakers. This training data will be saved and can later be used to generate new models for the different speakers. These new models can be used in later runs through the pipeline to recognize known speakers and be improved by gathering more training data for the known speakers. The thesis shows how machine learning can be applied in a pipeline to partition an audio track without any prior trained model. Using this information could be time-saving in a video-editing process, or in the process of creating training data. The pipeline also has the potential to be expanded with further functionality. This would require the pipeline to be further integrated into the video-editing system.	en_US
dc.language.iso	eng
dc.publisher	The University of Bergen	en_US
dc.rights	Copyright the Author. All rights reserved
dc.title	Speaker recognition implemented as part of a video-editing system
dc.type	Master thesis
dc.date.updated	2019-06-28T22:00:08Z
dc.rights.holder	Copyright the Author. All rights reserved	en_US
dc.description.degree	Masteroppgåve i informatikk	en_US
dc.description.localcode	INF399
dc.description.localcode	MAMN-PROG
dc.description.localcode	MAMN-INF
dc.subject.nus	754199
fs.subjectcode	INF399
fs.unitcode	12-12-0

Tilhørende fil(er)

Filnavn:: Masteroppgave-innlevering.pdf
Størrelse:: 4.911Mb
Format:: PDF
Beskrivelse:: master thesis

Åpne

Denne innførselen finnes i følgende samling(er)

Department of Informatics [915]

Vis enkel innførsel