eONPUIR

The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space

Показать сокращенную информацию

dc.contributor.author Matychenko, Anastasiia
dc.contributor.author Матиченко, Анастасія Денисівна
dc.contributor.author Polyakova, Marina
dc.contributor.author Полякова, Марина Вячеславівна
dc.date.accessioned 2023-07-13T21:41:18Z
dc.date.available 2023-07-13T21:41:18Z
dc.date.issued 2023-07-03
dc.identifier.citation Matychenko, A., Polyakova, M. (2023). The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space. Herald of Advanced Information Technology, Vol. 6, N 2, р. 115–127. еn
dc.identifier.citation Matychenko, A. Тhe structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space / A. Matychenko, M. Polyakova // Herald of Advanced Information Technology = Вісн. сучас. інформ. технологій. – Оdesa, 2023. – Vol. 6, N 2. – Р. 115–127. еn
dc.identifier.issn 2663-0176
dc.identifier.issn 2663-7731
dc.identifier.uri http://dspace.opu.ua/jspui/handle/123456789/13968
dc.description.abstract As a result of the literature analysis, the main methods for speaker identification from speech signals were defined. These are statistical methods based on Gaussian mixture model and a universal background model, as well as neural network methods, in particular, using convolutional or Siamese neural networks. The main characteristics of these methods are the recognition performance, a number of parameters, and the training time. High recognition performance is achieved by using convolutional neural networks, but a number of parameters of these networks are much higher than for statistical methods, although lower than for Siamese neural networks. A large number of parameters require a large training set, which is not always available for the researcher. In addition, despite the effectiveness of convolutional neural networks, model size and inference efficiency remain important for devices with a limited source of computing power, such as peripheral or mobile devices. Therefore, the aspects of tuning of the structure of existing convolutional neural networks are relevant for research. In this work, we have performed a structural tuning of an existing convolutional neural network based on the VGGNet architecture for speaker identification in the space of mel frequency cepstrum coefficients. The aim of the work was to reduce the number of neural network parameters and, as a result, to reduce the network training time, provided that the recognition performance is sufficient (the correct recognition is above 95 %). The neural network proposed as a result of structural tuning has fewer layers than the architecture of the basic neural network. Instead of the ReLU activation function, the related Leaky ReLU function with a parameter of 0.1 was used. The number of filters and the size of kernels in convolutional layers are changed. The size of kernels for the max pooling layer has been increased. It is proposed to use the averaging of the results of each convolution to input a two-dimensional convolution results to a fully connected layer with the Softmax activation function. The performed experiment showed that the number of parameters of the proposed neural network is 29 % less than the number of parameters of the basic neural network, provided that the speaker recognition performance is almost the same. In addition, the training time of the proposed and basic neural networks was evaluated on five datasets of audio recordings corresponding to different numbers of speakers. The training time of the proposed network was reduced by 10-39 % compared to the basic neural network. The results of the research show the advisability of the structural tuning of the convolutional neural network for devices with a limited source of computing, namely, peripheral or mobile devices en
dc.language.iso en_US en
dc.publisher Nauka i Tekhnika en
dc.subject VGGNet en
dc.subject Speaker identification en
dc.subject convolutional neural network en
dc.subject mel frequency cepstrum coefficients en
dc.subject structural tuning en
dc.subject deep learning en
dc.title The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space en
dc.title.alternative Структурне налаштування згорткової нейронної мережі для ідентифікації дикторів у просторі мелчастотних кепстральних коефіцієнтів en
dc.type Article en
opu.citation.journal Herald of Advanced Information Technology en
opu.citation.volume 6 en
opu.citation.firstpage 115 en
opu.citation.lastpage 127 en
opu.citation.issue 2 en
opu.staff.id https://orcid.org/0009-0009-7894-4734 en
opu.staff.id https://orcid.org/0000-0001-7229-7657 en


Файлы, содержащиеся в элементе

Этот элемент содержится в следующих коллекциях

Показать сокращенную информацию