Text clustering for analyzing scientific article using pre-trained language model and k-means algorithm

Firdaus Firdaus, Siti Nurmaini, Novi Yusliani, Muhammad Naufal Rachmatullah, Annisa Darmawahyuni, Yesi Novaria Kunang, Muhammad Fachrurrozi, Risky Armansyah

Abstract


Text clustering is a technique in data mining that can be used for analyzing scientific articles. In Indonesia-accredited journals, SINTA, there are two languages used, Indonesian and English. This is the first research focusing on clustering Indonesian and English texts into one cluster. In this research, bidirectional encoder representations from transformers (BERT) and IndoBERT are used to represent text data into fixed feature vectors. BERT and IndoBERT are pre-trained language models (PLMs) that can produce vector representations that take care of the position and context in a sentence. To cluster the articles, the K-Means algorithm is implemented. This algorithm has good convergence and adapts to the new examples, which helps in improved clustering performance. The best k-value in the K-Means algorithm is defined by using the silhouette score, the elbow method, and the Davies-Bouldin index (DBI). The experiment shows that the silhouette score can produce the most optimal k-value in clustering the articles, which has a mean score of 0.597. The mean score for the elbow method is 0.425, and for the DBI is 0.412. Therefore, the silhouette score optimizes the performance of PLMs and the K-Means algorithm in analyzing scientific articles to determine whether in scope or out of scope.

Keywords


Bidirectional encoder representations from transformers; IndoBERT; K-means algorithm; Pre-trained language model; Text clustering

Full Text:

PDF


DOI: https://doi.org/10.11591/eei.v14i5.9670

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Bulletin of EEI Stats

Bulletin of Electrical Engineering and Informatics (BEEI)
ISSN: 2089-3191e-ISSN: 2302-9285
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).