Speaker recognition implemented as part of a video-editing system

Tveita, Sebastian Rojas

Tveita, Sebastian Rojas

Master thesis

Åpne

master thesis (4.911Mb)

Permanent lenke

https://hdl.handle.net/1956/20860

Utgivelsesdato

2019-06-29

Metadata

Vis full innførsel

Samlinger

Department of Informatics [981]

Sammendrag

With machine learning rising to prominence over the last decades, a lot of companies are doing research on how it can be applied in their products or production. Some of these companies have used machine learning with a good deal of success. This thesis proposes a solution for integrating speaker recognition in a video-editing system. The proposed solution is a proof-of-concept pipeline that is hooked into a web-based video-editing system made by a company called Vizrt\cite{vrt}. This pipeline takes a video and performs speaker diarization and classification on the audio from the video. To achieve this, two types of models are applied; Gaussian Mixture Models to create a Universal Background Model, and models applying the i-vector approach for use in the clustering. From the results of the machine learning algorithms, the pipeline will produce timecodes that are sent to the video-editing system. These timecodes show where different speakers are talking. This information will be presented to the user in the system UI, where the user will have the option of correcting the results from the diarization and classification. The pipeline also adopts the results from the algorithms to provide further functionality. By using the generated timecodes, the pipeline is able to extract training data that is partitioned according to the speakers. This training data will be saved and can later be used to generate new models for the different speakers. These new models can be used in later runs through the pipeline to recognize known speakers and be improved by gathering more training data for the known speakers. The thesis shows how machine learning can be applied in a pipeline to partition an audio track without any prior trained model. Using this information could be time-saving in a video-editing process, or in the process of creating training data. The pipeline also has the potential to be expanded with further functionality. This would require the pipeline to be further integrated into the video-editing system.

Utgiver

The University of Bergen