Classification of English-Language Racial Hate Speech Using Random Forest
Keywords:
Hate Speech, Racial, Random Forest, Text Mining, TF-IDFAbstract
The phenomenon of hate speech on social media is increasingly widespread, particularly those with racial undertones. This study aims to classify English-language comments containing racial hate speech using the Random Forest algorithm. The dataset was obtained from Kaggle, consisting of 7,492 comments categorized into two classes: hate speech and non-hate speech. The research process included data preprocessing (cleansing, case folding, tokenizing, filtering, and stemming), term weighting using TF-IDF, and splitting the dataset into 80% training and 20% testing. The classification model was built using Random Forest and evaluated with a confusion matrix. The testing results on 200 test data showed 66 True Negative (TN), 65 True Positive (TP), 30 False Positive (FP), and 39 False Negative (FN). Based on these results, the model achieved an accuracy of 65.25%, precision of 68.42%, recall of 62.50%, and F1-score of 65.34%. This study highlights that the Random Forest algorithm is capable of delivering relatively good performance in detecting racial hate speech, although misclassifications still occur. The research emphasizes the potential of Random Forest as a method for automatic hate speech detection systems, which may serve as a reference for developing safer and more inclusive content moderation technologies.





