Three language political leaning text classification using natural language processing methods

Kosiv, Yurii; Косів, Юрій Андрійович; Косив, Юрий Андреевич; Yakovyna, Vitaliy; Яковина, Віталій Степанович; Яковина, Виталий Степанович

eONPUIR
→
1. Періодичні видання національного університету "Одеська політехніка"
→
Applied Aspects of Information Technology = Прикладні аспекти інформаційних технологій
→
2022, Vol. 5, № 4
→
Посмотреть элемент

dc.contributor.author	Kosiv, Yurii
dc.contributor.author	Косів, Юрій Андрійович
dc.contributor.author	Косив, Юрий Андреевич
dc.contributor.author	Yakovyna, Vitaliy
dc.contributor.author	Яковина, Віталій Степанович
dc.contributor.author	Яковина, Виталий Степанович
dc.date.accessioned	2023-05-03T21:26:20Z
dc.date.available	2023-05-03T21:26:20Z
dc.date.issued	2022-12-28
dc.identifier.citation	Kosiv, Yu., Yakovyna, V. (2022). Three language political leaning text classification using natural language processing methods. Аpplied Aspects of Information Technology, Vol. 5, N 4, р. 359–370.	еn
dc.identifier.citation	Kosiv, Yu. Three language political leaning text classification using natural language processing methods / Yu. Kosiv, V. Yakovyna // Аpplied Aspects of Information Technology = Прикладні аспекти інформ. технологій. – Оdesa, 2022. – Vol. 5, N 4. – P. 359–370.	еn
dc.identifier.issn	2617-4316
dc.identifier.issn	2663-7723
dc.identifier.uri	http://dspace.opu.ua/jspui/handle/123456789/13470
dc.description.abstract	In this article, the problem of political leaning classificationof the text resource is solved. First, a detailed analysis of ten stud-ies on the work’s topicwas performed in the form of comparative characteristicsof the used methodologies.Literary sources were compared according to the problem-solvingmethods,the learning that was carried out, the evaluation metrics, and according to the vectorizations.Thus, it was determined that machine learning algorithms and neural networks, as well as vectorizationmethods TF-IDF and Word2Vec, were most often used to solve the problem.Next, various classification models of whether textual information is pro-Ukrainian or pro-Russian were built based on a dataset containing messages from social media users about the events of the large-scale Russian invasion of Ukraine from February 24, 2022.The problem was solved with the help of Support Vector Machines, Decision Tree, Random Forest, Naïve Bayes classifier,eXtreme Gradient BoostingandLogistic Regressionmachine learning algo-rithms, Convolutional Neural Networks, Long short-term memory and BERT neural networks, techniques for working with unbal-anced dataRandom Oversampling, Random Undersampling , SMOTE and SMOTETomek, as well as stacking ensembles of models.Amongthe machine learning algorithms, LR performed best, showing a macro F1-scorevalue of 0.7966 when features were trans-formed by TF-IDF vectorization and 0.7933 when BoW.Among neural networks, the best macro F1-scorevalue of 0.76was ob-tained using CNN and LSTM.Applying data balancing techniques failed to improve the results of machine learning algorithms.Next, ensembles of models from machine learning algorithms were determined. Two of the constructed ensembles achieved the same macro F1-scorevalue of 0.7966 as with LR. Ensembles that wasable to do so consisted of the TF-IDF vectorization, the B-NBC meta-model, and the SVC, NuSVC LR, and SVC, LR base models, respectively.Thus, three classifiers, the LR machine learning algorithmand two ensembles of models, which were defined as a combination of existing methods of solving the problem, demon-strated the largest macro F1-score value of 0.7966. The obtained models can be used for a detailed review of various news publica-tions according to the political leaning characteristic, information about which can help people identify being isolated by a filter bubble.	en
dc.description.abstract	У цій статті здійснюється розв’язання задачі класифікації політичної забарвленості текстового ресурсу. Спочатку ви-конано детальний аналіз десяти досліджень за темою роботи у вигляді порівняльної характеристики інструментарію. Літе-ратурні джерела порівнювались заметодами розв’язання задач, здійсненим навчанням, метриками оцінки та способами век-торизації. Таким чином визначено, що для розв’язання задачі найчастіше використовувались алгоритми машинного навчан-ня та нейронні мережі, а також способи представлення ознак TF-IDF та Word2Vec. Далі було побудовано різноманітні моде-лі класифікації того, чи текстова інформація є проукраїнською, чи проросійськоюна основі набору даних, що містив пові-домлення користувачів соціальних мереж про події широкомасштабного російського вторгнення в Україну з 24 лютого 2022 року. Розв’язання задачі здійснювалось за допомогою алгоритмів машинного навчання SupportVectorMachines, DecisionTree, RandomForest,NaïveBayesclassifier, eXtremeGradientBoostingта LogisticRegression, нейронних мереж ConvolutionalNeuralNetworks, Longshort-termmemoryта BERT,технік роботи з незбалансованими данимиRandomOversampling, RandomUndersampling, SMOTE та SMOTETomek, а також ансамблів моделей stacking. З алгоритмів машинного навчання найкраще впорався LR, який продемонстрував значення макро F1-міри рівне 0.7966, коли ознаки були перетворені векторизацією TF-IDF, а коли BoW –0.7933.З нейронних мереж найкраще значення макро F1-міри рівне 0.76отримано за допомогою CNN та LSTM.Застосуванням технік балансування даних не вдалося покращити результати алгоритмів машинного навчання. Далі були визначені ансамблі моделей, які складались з алгоритмів машинного навчання. Двома з побудованих ансамблів було досягнуто те ж значення макро F1-міри 0.7966, що і за допомогою LR. Ансамблі, яким вдалося це зробити, складались з векторизації TF-IDF, метамоделі B-NBC та базових моделей SVC, NuSVC LR і SVC, LR відповідно. Таким чиномтри кла-сифікатори,алгоритм машинного навчання LRта два ансамблі моделей, які були визначені шляхом здійснення комбінації наявних способів розв’язання задачі класифікації політичної забарвленості текстового ресурсу,продемонстрували найбіль-шезначення макроF1-міри 0.7966. Отримані моделі можуть бути використані для детального огляду різних новинних ви-дань за характеристикою політичної забарвленості, інформація про що може допомогти ідентифікувати перебування в інфо-рмаційній бульбашці	en
dc.language.iso	en	en
dc.publisher	Odessа Polytechnic National University	en
dc.subject	Text classification	en
dc.subject	political leaning	en
dc.subject	machine learning algorithms	en
dc.subject	neural networks	en
dc.subject	ensembles of models	en
dc.subject	natural language processing	en
dc.subject	Класифікація тексту	en
dc.subject	політична забарвленість	en
dc.subject	політична забарвленість	en
dc.subject	нейронні мережі	en
dc.subject	ансамблі моделей	en
dc.subject	обробка природної мови	en
dc.title	Three language political leaning text classification using natural language processing methods	en
dc.title.alternative	Класифікаціяполітичної забарвленості тексту трьома мовами з використанням методів опрацювання природної мови	en
dc.type	Article	en
opu.citation.journal	Applied Aspects of Information Technology	en
opu.citation.volume	4	en
opu.citation.firstpage	359	en
opu.citation.lastpage	370	en
opu.citation.issue	5	en