Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

Budhi, Gregorius Satia; Chiong, Raymond; Wang, Zuli

doi:10.1007/s11042-020-10299-5

Author(s)

Budhi, Gregorius Satia

Chiong, Raymond

Wang, Zuli

Publication Date

2021

Abstract

<p>Fraudulent online sellers often collude with reviewers to garner fake reviews for their products. This act undermines the trust of buyers in product reviews, and potentially reduces the effectiveness of online markets. Being able to accurately detect fake reviews is, therefore, critical. In this study, we investigate several preprocessing and textual-based featuring methods along with machine learning classifiers, including single and ensemble models, to build a fake review detection system. Given the nature of product review data, where the number of fake reviews is far less than that of genuine reviews, we look into the results of each class in detail in addition to the overall results. We recognise from our preliminary analysis that, owing to imbalanced data, there is a high imbalance between the accuracies for different classes (e.g., 1.3% for the fake review class and 99.7% for the genuine review class), despite the overall accuracy looking promising (around 89.7%). We propose two dynamic random sampling techniques that are possible for textual-based featuring methods to solve this class imbalance problem. Our results indicate that both sampling techniques can improve the accuracy of the fake review class—for balanced datasets, the accuracies can be improved to a maximum of 84.5% and 75.6% for random under and over-sampling, respectively. However, the accuracies for genuine reviews decrease to 75% and 58.8% for random under and over-sampling, respectively. We also discover that, for smaller datasets, the Adaptive Boosting ensemble model outperforms other single classifiers" whereas, for larger datasets, the performance improvement from ensemble models is insignificant compared to the best results obtained by single classifiers.</p>

Citation

Multimedia Tools and Applications, v.80, p. 13079-13097

ISSN

1573-7721

1380-7501

Link

https://hdl.handle.net/1959.11/61394

Language

en

Publisher

Springer New York LLC

Title

Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

Type of document

Journal Article

Entity Type

Publication

Author(s)	Budhi, Gregorius Satia Chiong, Raymond Wang, Zuli
Publication Date	2021
Abstract	<p>Fraudulent online sellers often collude with reviewers to garner fake reviews for their products. This act undermines the trust of buyers in product reviews, and potentially reduces the effectiveness of online markets. Being able to accurately detect fake reviews is, therefore, critical. In this study, we investigate several preprocessing and textual-based featuring methods along with machine learning classifiers, including single and ensemble models, to build a fake review detection system. Given the nature of product review data, where the number of fake reviews is far less than that of genuine reviews, we look into the results of each class in detail in addition to the overall results. We recognise from our preliminary analysis that, owing to imbalanced data, there is a high imbalance between the accuracies for different classes (e.g., 1.3% for the fake review class and 99.7% for the genuine review class), despite the overall accuracy looking promising (around 89.7%). We propose two dynamic random sampling techniques that are possible for textual-based featuring methods to solve this class imbalance problem. Our results indicate that both sampling techniques can improve the accuracy of the fake review class—for balanced datasets, the accuracies can be improved to a maximum of 84.5% and 75.6% for random under and over-sampling, respectively. However, the accuracies for genuine reviews decrease to 75% and 58.8% for random under and over-sampling, respectively. We also discover that, for smaller datasets, the Adaptive Boosting ensemble model outperforms other single classifiers" whereas, for larger datasets, the performance improvement from ensemble models is insignificant compared to the best results obtained by single classifiers.</p>
Citation	Multimedia Tools and Applications, v.80, p. 13079-13097
ISSN	1573-7721 1380-7501
Link	https://hdl.handle.net/1959.11/61394
Language	en
Publisher	Springer New York LLC
Title	Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features
Type of document	Journal Article
Entity Type	Publication

Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

Files: