Data normalization methods to improve the quality of classification in the breast cancer diagnostic system

Polyakova, Marina; Полякова, Марина В`ячеславівна; Полякова, Марина Вячеславовна; Krylov, Viktor; Крилов, Віктор Миколайович; Крылов, Виктор Николаевич

eONPUIR
→
1. Періодичні видання національного університету "Одеська політехніка"
→
Applied Aspects of Information Technology = Прикладні аспекти інформаційних технологій
→
2022, Vol. 5, № 1
→
Посмотреть элемент

dc.contributor.author	Polyakova, Marina
dc.contributor.author	Полякова, Марина В`ячеславівна
dc.contributor.author	Полякова, Марина Вячеславовна
dc.contributor.author	Krylov, Viktor
dc.contributor.author	Крилов, Віктор Миколайович
dc.contributor.author	Крылов, Виктор Николаевич
dc.date.accessioned	2022-04-20T21:56:13Z
dc.date.available	2022-04-20T21:56:13Z
dc.date.issued	2021-03-17
dc.identifier.citation	Polyakova, M., Krylov, V. (2022). Data normalization methods to improve the quality of classification in the breast cancer diagnostic system. Аpplied Aspects of Information Technology, Vol. 5, N 1, p. 55–63.	еn
dc.identifier.citation	Polyakova, M. Data normalization methods to improve the quality of classification in the breast cancer diagnostic system / M. Polyakova, V. Krylov // Аpplied Aspects of Information Technology = Прикладні аспекти інформ. технологій. – Оdesa, 2022. – Vol. 5, N 1. – P. 55–63.	en
dc.identifier.issn	2617-4316
dc.identifier.issn	2663-7723
dc.identifier.uri	http://dspace.opu.ua/jspui/handle/123456789/12501
dc.description.abstract	In oncology diagnostic systems, images of cells obtained from breast biopsy are often identified by statistical and geometric features. To classify the values of these features, presented, in particular, in the Wisconsin Diagnostic Breast Cancer dataset, a naive Bayesian classifier, the k-nearest neighbor’s method, neural networks, and ensembles of decision trees were used in the literature. It is noticed that the classification results obtained with using these methods differ mainly within the limits of the statistical error. This is related to the selection of the classifier which is determined by the shape of the clusters and the presence of data outliers. They are significantly affected by data preparing, in particular, the method of normalization of the feature values. Normalization is defined as transforming the values of features to a certain interval. The difference in the intervals of feature values can lead to implicit weighting of features in their classification. After feature extraction and normalization, a set of data belonging to the same class may be divided into several clusters as a result of feature space distortion. To separate such data into one class, the distance between them must be greater than the internal scatter of data in each of the clusters. Therefore, in addition to normalization, data preparing can include decorrelation and orthogonalization of features, using, e.g., principal component analysis which selects feature projections with better class separation. So to improve the quality of classification, in the article the data preparation methods are used, namely data normalization methods and data analysis using principal components. It is shown that it is advisable to use the standard, robust, or minimax normalization of cell feature vectors if the k-nearest neighbor’s classifier or a naive Bayesian classifier is selected. If the classification of cell feature vectors in breast biopsy images was carried out using an ensemble of decision trees, the use of normalization did not improve the quality of the classification. It is advisable to reduce the dimension of the feature space by analyzing the principal components only for the k-nearest method. When using a naive Bayesian classifier and ensembles of decision trees, the transition to principal components reduces the quality of the classification. The results obtained in the article allow choosing the preparing data methods for a specific problem.	en
dc.description.abstract	У системах діагностування онкології отримані в результаті біопсії молочної залози зображення клітин часто ідентифікують статистичними і геометричними ознаками. Для класифікації значень цих ознак, представлених, зокрема, в тестовій базі Wisconsin Diagnostic Breast Cancer, в літературі використовувалися наївний байєсівський класифікатор, метод kнайближчих сусідів, нейронні мережі і ансамблі дерев рішень. Помічено, що результати класифікації, отримані із застосуванням цих методів, в основному, відрізняються в межах статистичної похибки. На форму кластерів та наявність викидів даних суттєво впливає підготовка даних, зокрема метод нормалізації значень їх ознак. Під нормалізацією розуміється приведення значень ознак до певного інтервалу. Різниця в інтервалах значень ознак може призвести до неявного зважування ознак під час класифікації об’єктів. Після виділення ознак та їх нормалізації множина даних, що належать одному класу, може бути розбитою на декілька кластерів у результаті спотворення ознакового простору. Для виділення таких даних в один клас відстань між ними має бути більшою за внутрішній розкид даних у кожному з кластерів. Тому крім нормалізації підготовка даних може включати декореляцію та ортогоналізацію ознак, наприклад, за допомогою аналізу головних компонентів, який обирає проекції ознак з кращим розподілом класів. Отже для підвищення якості класифікації в роботі використовувалися методи нормалізації даних і метод аналізу даних за допомогою головних компонент. Показано, що доцільно використовувати стандартне, робастне або мінімаксне нормування векторів ознак клітин, якщо обраний класифікатор k-найближчих сусідів або наївний байєсівський класифікатор. Якщо класифікація векторів ознак клітин на зображеннях біопсії молочної залози проводилася за допомогою ансамблю дерев рішень, застосування нормалізації не дало підвищення якості класифікації. Скорочення розмірності простору ознак шляхом аналізу головних компонент доцільно проводити тільки для методу kнайближчих сусідів. При використанні наївного байєсівського класифікатора і ансамблів дерев рішень перехід до головних компонентів знижує якість класифікації. Використовуючи результати проведеного експерименту, дослідник може вибрати методи підготовки даних для конкретного завдання.	en
dc.language.iso	en	en
dc.publisher	Odessa National Polytechnic University	en
dc.subject	Data normalization	en
dc.subject	principal component analysis	en
dc.subject	naive Bayesian classifier	en
dc.subject	k-nearest neighborhood method	en
dc.subject	ensembles of solution trees	en
dc.subject	cascade forest	en
dc.subject	deep forest	en
dc.subject	нормалізація даних	en
dc.subject	аналіз головних компонент	en
dc.subject	наївний баєсівський класифікатор	en
dc.subject	метод k-найближчих сусідів	en
dc.subject	ансамблі дерев рішень	en
dc.subject	каскадний ліс	en
dc.subject	глибокий ліс	en
dc.title	Data normalization methods to improve the quality of classification in the breast cancer diagnostic system	en
dc.title.alternative	Методи нормалізації даних для покращення якості класифікації у системі діагностики онкології молочної залози	en
dc.type	Article	en
opu.citation.journal	Applied Aspects of Information Technology	en
opu.citation.volume	1	en
opu.citation.firstpage	55	en
opu.citation.lastpage	63	en
opu.citation.issue	5	en