Improving genomic classification via Pearson-based SNP selection: a comparison of k-NN, SVM, and random forest

Prihanto Ngesti Basuki, Sri Yulianto Joko Prasetyo, Adi Setiawan

Abstract


Accurate genomic classification is vital for precision health and population studies, yet high-dimensional single-nucleotide polymorphism (SNP) data (p>>n) amplify noise, redundancy, and overfitting. This study evaluates a simple, model-independent Pearson-based selection that ranks SNPs by feature–label correlation, and assesses k-nearest neighbors (k-NN), linear support vector machine (SVM), and random forest (RF) under leakage-free stratified Monte Carlo cross-validation (MCCV). Performance increases monotonically with |r|: the strongest tiers reach ?99–100% accuracy; SVM leads in mid tiers (RF second), while k-NN is competitive mainly at the extremes. A matched-dimensionality PCA-120 baseline (TRAIN-only) attains parity for SVM/RF and trails slightly for k-NN at the 10% test size. With 120-SNP panels, prediction medians are ?0.30 ms (SVM), 1.81–1.83 ms (k-NN), and 34–35 ms (RF), supporting CPU-only deployment. A consensus panel combining correlation evidence with principal component analysis (PCA) selection frequency yields interpretable Top-20/Top-120 subsets and |r|-based operating thresholds. Overall, Pearson-based selection provides a transparent, reproducible baseline for small-sample SNP classification, offering accuracy competitive with PCA at lower computational complexity and straightforward extensions to broader cohorts and multi-omics integration.

Keywords


Ancestry inference; High-dimensional data (small-n, high-p); Leakage-free evaluation; Monte Carlo cross-validation; Point-biserial correlation; Principal component analysis; Receiver operating-characteristic area under the curve

Full Text:

PDF


DOI: https://doi.org/10.11591/eei.v15i1.9087

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Bulletin of EEI Stats

Bulletin of Electrical Engineering and Informatics (BEEI)
ISSN: 2089-3191, e-ISSN: 2302-9285
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).