Volume 20 No 10 (2022)
 Download PDF
Cat Boost encoded Shannon Entropy based featurereduction model for 16S rDNA sequences using Ensemble algorithms
M Meharunnisa , M Sornam and B Ramesh
Abstract
Most of the genes in the bacterial genome are still unknown or only partially understood. The machine learning algorithm takes an inordinate amount of time to predict with those uninformative features. The K-mer based sliding window method is the most commonly used classification approach in terms of speed and accuracy, but the method ignores position specific information. The purpose of this paper is to extract the most informative features from multiple sequence aligned 16S sequences and then classify them using ensemble algorithms like XGBoost Classifier,AdaBoost,Bagging Classifier and Random Forest. CatBoost Encoded Shannon Entropy (CBSE) is a novel feature reduction technique that was developed bycombining Categorical Boosting and Shannon Entropy techniques to extract informative features. Following that, the reduced dataset is used to train the bagging and boosting ensemble algorithms. A systematic comparison of different Shannon entropy thresholds and four different K-mer methods was performed. The models are evaluated using various classification metrics such as accuracy, F1-Score, and execution time (in seconds). The results show that when Random Forest combined with CBSE, achieves a high accuracy of 98%, an F1-score of 98% with only 442 features when compared to the original 2011 features on testing data. The statistical analysis confirms that CBSE-based ensemble algorithms outperform the K-mer based ensembletechnique.
Keywords
Categorical Encoding, Shannon Entropy, K-mer Encoding, Ensemble Algorithms
Copyright
Copyright © Neuroquantology

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Articles published in the Neuroquantology are available under Creative Commons Attribution Non-Commercial No Derivatives Licence (CC BY-NC-ND 4.0). Authors retain copyright in their work and grant IJECSE right of first publication under CC BY-NC-ND 4.0. Users have the right to read, download, copy, distribute, print, search, or link to the full texts of articles in this journal, and to use them for any other lawful purpose.