Submit Manuscript  

Article Details


Application of Feature Selection Technology Based on Incremental of Diversity in Prediction of Flexible Regions from Protein Sequences

[ Vol. 14 , Issue. 9 ]

Author(s):

Suqing Yang, Shisai Hu, Ying Zhang* and Jun Lv   Pages 642 - 647 ( 6 )

Abstract:


Background: The flexibility of protein structures is often related to the function of the protein. Feature selection (FS) is very critical to the application of a lot of machine learning which deals with small sampling and high-dimensional data. For the prediction of the flexible regions by the protein sequences, it is important to build a machine learning methodology which is based on an effective feature selection technology. This may also provide new knowledge to understand the protein folding process.

Method: Firstly, the frequencies of the k-spaced amino acid pairs are taken as a representation of the local sequences. Secondly, these representations are processed by feature selection based on incremental of diversity (FSID) to reduce the dimensionality. Finally, the logistic regression approach is applied to integrate the selected features into a scheme to discriminate flexible or rigid (referred to as FSID_FRP).

Results: 74 features are selected from the set of 66 sequences, which includes 26 flexible patterns and 48 rigid patterns. Most of the flexible patterns are associated with Glycine or Proline, and the rigid patterns are associated with Leucine or Valine. We obtained 79.41% accuracy and 0.51 MCC using the FSID_FRP method in which we applied logistic regression and used the representation of the 74 features. The results of FSID_FRP method are comparable to that of FlexRP method that includes 95 features.

Conclusion: A simple feature selection method FSID is shown to be very efficient in the prediction of the flexible/rigid regions of protein sequences. This method is more appropriate for small-sampling classification than the entropy-based feature selection method. The proposed FSID_FRP method achieved 80% prediction accuracy and stronger generalization ability.

Keywords:

Feature selection, increment of diversity, k-spaced amino acid pairs, logistic regression, protein flexible regions, protein sequences.

Affiliation:

College of Science, Inner Mongolia University of Technology, Hohhot 010051, College of Science, Inner Mongolia University of Technology, Hohhot 010051, College of Science, Inner Mongolia University of Technology, Hohhot 010051, College of Science, Inner Mongolia University of Technology, Hohhot 010051

Graphical Abstract:



Read Full-Text article