The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space

Matychenko, Anastasiia; Матиченко, Анастасія Денисівна; Polyakova, Marina; Полякова, Марина Вячеславівна

dc.contributor.author	Matychenko, Anastasiia
dc.contributor.author	Матиченко, Анастасія Денисівна
dc.contributor.author	Polyakova, Marina
dc.contributor.author	Полякова, Марина Вячеславівна
dc.date.accessioned	2023-07-13T21:41:18Z
dc.date.available	2023-07-13T21:41:18Z
dc.date.issued	2023-07-03
dc.identifier.citation	Matychenko, A., Polyakova, M. (2023). The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space. Herald of Advanced Information Technology, Vol. 6, N 2, р. 115–127.	еn
dc.identifier.citation	Matychenko, A. Тhe structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space / A. Matychenko, M. Polyakova // Herald of Advanced Information Technology = Вісн. сучас. інформ. технологій. – Оdesa, 2023. – Vol. 6, N 2. – Р. 115–127.	еn
dc.identifier.issn	2663-0176
dc.identifier.issn	2663-7731
dc.identifier.uri	http://dspace.opu.ua/jspui/handle/123456789/13968
dc.description.abstract	As a result of the literature analysis, the main methods for speaker identification from speech signals were defined. These are statistical methods based on Gaussian mixture model and a universal background model, as well as neural network methods, in particular, using convolutional or Siamese neural networks. The main characteristics of these methods are the recognition performance, a number of parameters, and the training time. High recognition performance is achieved by using convolutional neural networks, but a number of parameters of these networks are much higher than for statistical methods, although lower than for Siamese neural networks. A large number of parameters require a large training set, which is not always available for the researcher. In addition, despite the effectiveness of convolutional neural networks, model size and inference efficiency remain important for devices with a limited source of computing power, such as peripheral or mobile devices. Therefore, the aspects of tuning of the structure of existing convolutional neural networks are relevant for research. In this work, we have performed a structural tuning of an existing convolutional neural network based on the VGGNet architecture for speaker identification in the space of mel frequency cepstrum coefficients. The aim of the work was to reduce the number of neural network parameters and, as a result, to reduce the network training time, provided that the recognition performance is sufficient (the correct recognition is above 95 %). The neural network proposed as a result of structural tuning has fewer layers than the architecture of the basic neural network. Instead of the ReLU activation function, the related Leaky ReLU function with a parameter of 0.1 was used. The number of filters and the size of kernels in convolutional layers are changed. The size of kernels for the max pooling layer has been increased. It is proposed to use the averaging of the results of each convolution to input a two-dimensional convolution results to a fully connected layer with the Softmax activation function. The performed experiment showed that the number of parameters of the proposed neural network is 29 % less than the number of parameters of the basic neural network, provided that the speaker recognition performance is almost the same. In addition, the training time of the proposed and basic neural networks was evaluated on five datasets of audio recordings corresponding to different numbers of speakers. The training time of the proposed network was reduced by 10-39 % compared to the basic neural network. The results of the research show the advisability of the structural tuning of the convolutional neural network for devices with a limited source of computing, namely, peripheral or mobile devices	en
dc.language.iso	en_US	en
dc.publisher	Nauka i Tekhnika	en
dc.subject	VGGNet	en
dc.subject	Speaker identification	en
dc.subject	convolutional neural network	en
dc.subject	mel frequency cepstrum coefficients	en
dc.subject	structural tuning	en
dc.subject	deep learning	en
dc.title	The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space	en
dc.title.alternative	Структурне налаштування згорткової нейронної мережі для ідентифікації дикторів у просторі мелчастотних кепстральних коефіцієнтів	en
dc.type	Article	en
opu.citation.journal	Herald of Advanced Information Technology	en
opu.citation.volume	6	en
opu.citation.firstpage	115	en
opu.citation.lastpage	127	en
opu.citation.issue	2	en
opu.staff.id	https://orcid.org/0009-0009-7894-4734	en
opu.staff.id	https://orcid.org/0000-0001-7229-7657	en