Profiling DNA Sequence of SARS-Cov-2 Virus Using Machine Learning Algorithm
Lailil Muflikhah, Muh. Arif Rahman, Agus Wahyu Widodo
Abstract
Corona virus disease-19 (COVID-19) is growing rapidly because it is an infectious disease. This disease is caused by a virus belonging to the type of DNA virus with very diverse genetics. This study proposes a feature extraction method using k-mer to obtain nucleotide frequencies in protein coding. In profiling viral DNA sequences, this study proposes to obtain similarity by country using hierarchical k-means, where the results are averaged by the hierarchical clustering method and then find the initial cluster center. The experimental results show that the silhouette, purity, and entropy are 0.867, 0.208, and 0.892, respectively. Then, we apply the Gini index feature selection to find the important components as characteristics in each country. The selected components are implemented using the ensemble method, Random Forest, to evaluate their performance. The experimental results showed high performance, including sensitivity, accuracy, specificity, and area under the curve (AUC).
Keywords
Covid-19; DNA sequence; Feature extraction; k-mer; Random Forest
DOI:
https://doi.org/10.11591/eei.v11i2.3487
Refbacks
There are currently no refbacks.
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License .
<div class="statcounter"><a title="hit counter" href="http://statcounter.com/free-hit-counter/" target="_blank"><img class="statcounter" src="http://c.statcounter.com/10241695/0/5a758c6a/0/" alt="hit counter"></a></div>
Bulletin of EEI Stats
Bulletin of Electrical Engineering and Informatics (BEEI) ISSN: 2089-3191, e-ISSN: 2302-9285 This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU) .