2013 8th International Symposium on Health Informatics and Bioinformatics, HIBIT 2013, Ankara, Turkey, 25 - 27 September 2013
Motif extraction from protein sequences has been a challenging task for bioinformaticians. Class-specific motifs, which are frequently found in one class but are in small ratio in other classes can be used for highly accurate classification of protein sequences. In this study, we present a new scoring based method for class-specific n-gram motif selection using reduced amino acid alphabets. Cohesin protein sequences, which interact with Dockerin modules to construct the most common and abundant organic polymer Cellulosome is used for class specific motif selection, and selected motifs are then given to J48 and SVM algorithms as features. Results of classification are examined with parameters of various n-gram sizes, reduced amino acid alphabets and feature number. Result with training accuracy of 98.61 % and test accuracy of 94.54 %, was found to be best one using Gbmr14 alphabet, 5 features per family, 4-gram motifs and J48 algorithm. The proposed technique can be generalized to use for other protein families. © 2013 IEEE.