The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space

Matychenko, Anastasiia; Матиченко, Анастасія Денисівна; Polyakova, Marina; Полякова, Марина Вячеславівна

Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на этот ресурс: http://dspace.opu.ua/jspui/handle/123456789/13968

Название:	The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space
Другие названия:	Структурне налаштування згорткової нейронної мережі для ідентифікації дикторів у просторі мелчастотних кепстральних коефіцієнтів
Авторы:	Matychenko, Anastasiia Матиченко, Анастасія Денисівна Polyakova, Marina Полякова, Марина Вячеславівна
Ключевые слова:	VGGNet Speaker identification convolutional neural network mel frequency cepstrum coefficients structural tuning deep learning
Дата публикации:	3-Июл-2023
Издательство:	Nauka i Tekhnika
Библиографическое описание:	Matychenko, A., Polyakova, M. (2023). The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space. Herald of Advanced Information Technology, Vol. 6, N 2, р. 115–127. Matychenko, A. Тhe structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space / A. Matychenko, M. Polyakova // Herald of Advanced Information Technology = Вісн. сучас. інформ. технологій. – Оdesa, 2023. – Vol. 6, N 2. – Р. 115–127.
Краткий осмотр (реферат):	As a result of the literature analysis, the main methods for speaker identification from speech signals were defined. These are statistical methods based on Gaussian mixture model and a universal background model, as well as neural network methods, in particular, using convolutional or Siamese neural networks. The main characteristics of these methods are the recognition performance, a number of parameters, and the training time. High recognition performance is achieved by using convolutional neural networks, but a number of parameters of these networks are much higher than for statistical methods, although lower than for Siamese neural networks. A large number of parameters require a large training set, which is not always available for the researcher. In addition, despite the effectiveness of convolutional neural networks, model size and inference efficiency remain important for devices with a limited source of computing power, such as peripheral or mobile devices. Therefore, the aspects of tuning of the structure of existing convolutional neural networks are relevant for research. In this work, we have performed a structural tuning of an existing convolutional neural network based on the VGGNet architecture for speaker identification in the space of mel frequency cepstrum coefficients. The aim of the work was to reduce the number of neural network parameters and, as a result, to reduce the network training time, provided that the recognition performance is sufficient (the correct recognition is above 95 %). The neural network proposed as a result of structural tuning has fewer layers than the architecture of the basic neural network. Instead of the ReLU activation function, the related Leaky ReLU function with a parameter of 0.1 was used. The number of filters and the size of kernels in convolutional layers are changed. The size of kernels for the max pooling layer has been increased. It is proposed to use the averaging of the results of each convolution to input a two-dimensional convolution results to a fully connected layer with the Softmax activation function. The performed experiment showed that the number of parameters of the proposed neural network is 29 % less than the number of parameters of the basic neural network, provided that the speaker recognition performance is almost the same. In addition, the training time of the proposed and basic neural networks was evaluated on five datasets of audio recordings corresponding to different numbers of speakers. The training time of the proposed network was reduced by 10-39 % compared to the basic neural network. The results of the research show the advisability of the structural tuning of the convolutional neural network for devices with a limited source of computing, namely, peripheral or mobile devices
URI (Унифицированный идентификатор ресурса):	http://dspace.opu.ua/jspui/handle/123456789/13968
ISSN:	2663-0176 2663-7731
Располагается в коллекциях:	2023, Vol. 6, № 2

Файлы этого ресурса:

Файл	Описание	Размер	Формат
161-Article Text-205-1-10-20230709.pdf		719.92 kB	Adobe PDF	Просмотреть/Открыть

Показать полное описание ресурса Просмотр статистики

Все ресурсы в архиве электронных ресурсов защищены авторским правом, все права сохранены.