Improving genomic classification via Pearson-based SNP selection: a comparison of k-NN, SVM, and random forest
Prihanto Ngesti Basuki, Sri Yulianto Joko Prasetyo, Adi Setiawan
Abstract
Accurate genomic classification is vital for precision health and population studies, yet high-dimensional single-nucleotide polymorphism (SNP) data (p>>n) amplify noise, redundancy, and overfitting. This study evaluates a simple, model-independent Pearson-based selection that ranks SNPs by feature–label correlation, and assesses k-nearest neighbors (k-NN), linear support vector machine (SVM), and random forest (RF) under leakage-free stratified Monte Carlo cross-validation (MCCV). Performance increases monotonically with |r|: the strongest tiers reach ?99–100% accuracy; SVM leads in mid tiers (RF second), while k-NN is competitive mainly at the extremes. A matched-dimensionality PCA-120 baseline (TRAIN-only) attains parity for SVM/RF and trails slightly for k-NN at the 10% test size. With 120-SNP panels, prediction medians are ?0.30 ms (SVM), 1.81–1.83 ms (k-NN), and 34–35 ms (RF), supporting CPU-only deployment. A consensus panel combining correlation evidence with principal component analysis (PCA) selection frequency yields interpretable Top-20/Top-120 subsets and |r|-based operating thresholds. Overall, Pearson-based selection provides a transparent, reproducible baseline for small-sample SNP classification, offering accuracy competitive with PCA at lower computational complexity and straightforward extensions to broader cohorts and multi-omics integration.
Keywords
Ancestry inference; High-dimensional data (small-n, high-p); Leakage-free evaluation; Monte Carlo cross-validation; Point-biserial correlation; Principal component analysis; Receiver operating-characteristic area under the curve
DOI:
https://doi.org/10.11591/eei.v15i1.9087
Refbacks
There are currently no refbacks.
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License .
<div class="statcounter"><a title="hit counter" href="http://statcounter.com/free-hit-counter/" target="_blank"><img class="statcounter" src="http://c.statcounter.com/10241695/0/5a758c6a/0/" alt="hit counter"></a></div>
Bulletin of EEI Stats
Bulletin of Electrical Engineering and Informatics (BEEI) ISSN: 2089-3191 , e-ISSN: 2302-9285 This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU) .