


Volume 20 No 10 (2022)
Download PDF
Cat Boost encoded Shannon Entropy based featurereduction model for 16S rDNA sequences using Ensemble algorithms
M Meharunnisa , M Sornam and B Ramesh
Abstract
Most of the genes in the bacterial genome are still unknown or only partially
understood. The machine learning algorithm takes an inordinate amount of time to
predict with those uninformative features. The K-mer based sliding window method is
the most commonly used classification approach in terms of speed and accuracy, but
the method ignores position specific information. The purpose of this paper is to
extract the most informative features from multiple sequence aligned 16S sequences
and then classify them using ensemble algorithms like XGBoost
Classifier,AdaBoost,Bagging Classifier and Random Forest. CatBoost Encoded Shannon
Entropy (CBSE) is a novel feature reduction technique that was developed bycombining Categorical Boosting and Shannon Entropy techniques to extract
informative features. Following that, the reduced dataset is used to train the bagging
and boosting ensemble algorithms. A systematic comparison of different Shannon
entropy thresholds and four different K-mer methods was performed. The models are
evaluated using various classification metrics such as accuracy, F1-Score, and
execution time (in seconds). The results show that when Random Forest combined
with CBSE, achieves a high accuracy of 98%, an F1-score of 98% with only 442 features
when compared to the original 2011 features on testing data. The statistical analysis
confirms that CBSE-based ensemble algorithms outperform the K-mer based
ensembletechnique.
Keywords
Categorical Encoding, Shannon Entropy, K-mer Encoding, Ensemble Algorithms
Copyright
Copyright © Neuroquantology
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Articles published in the Neuroquantology are available under Creative Commons Attribution Non-Commercial No Derivatives Licence (CC BY-NC-ND 4.0). Authors retain copyright in their work and grant IJECSE right of first publication under CC BY-NC-ND 4.0. Users have the right to read, download, copy, distribute, print, search, or link to the full texts of articles in this journal, and to use them for any other lawful purpose.