LIÊN KẾT WEBSITE
Noise-adaptive synthetic oversampling technique
Applied Intelligence Số 11, năm 2021 (Tập 51, trang 7827-7836)
ISSN: 0924669X
ISSN: 0924669X
DOI: 10.1007/s10489-021-02341-2
Tài liệu thuộc danh mục:
Article
English
Từ khóa: Machine learning; Class imbalance problems; Empirical experiments; Hybrid techniques; Imbalanced dataset; Machine learning models; Number of samples; Oversampling technique; Predictive performance; Large dataset
Tóm tắt tiếng anh
In the field of supervised learning, the problem of class imbalance is one of the most difficult problems, and has attracted a great deal of research attention in recent years. In an imbalanced dataset, minority classes are those that contain very small numbers of data samples, while the remaining classes have a very large number of data samples. This type of imbalance reduces the predictive performance of machine learning models. There are currently three approaches for dealing with the class imbalance problem: algorithm-level, data-level, and ensemble-based approaches. Of these, data-level approaches are the most widely used, and consist of three sub-categories: under-sampling, oversampling, and hybrid techniques. Oversampling techniques generate synthetic samples for the minority class to balance an imbalanced dataset. However, existing oversampling approaches do not have a strategy for handling noise samples in imbalanced and noisy datasets, which leads to a reduction in the predictive performance of machine learning models. This study therefore proposes a noise-adaptive synthetic oversampling technique (NASOTECH) to deal with the class imbalance problem in imbalanced and noisy datasets. The noise-adaptive synthetic oversampling (NASO) strategy is first introduced, which is used to identify the number of samples generated for each sample in the minority class, based on the concept of the noise ratio. Next, the NASOTECH algorithm is proposed, based on the NASO strategy, to handle the class imbalance problem in imbalanced and noisy datasets. Finally, empirical experiments are conducted on several synthetic and real datasets to verify the effectiveness of the proposed approach. The experimental results confirm that NASOTECH outperforms three state-of-the-art oversampling techniques in terms of accuracy and geometric mean (G-mean) on imbalanced and noisy datasets. 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.