Proteins are the most important cell parts, therefore, knowing their exact function is of a great significance. However, the function of large amount of proteins is still unknown. In addition, today, biologists persist on hierarchical organization the living world, and thus in protein databases also. There are many protein classification algorithms proposed determining the protein function, but, only a few of them take into consideration these hierarchical structures. The Gene Ontology (GO) is a protein and gene database structured as a controlled hierarchical vocabulary of terms to describe protein functions. This paper introduces a new hierarchical multi-label protein classifier that uses the relationships among the GO terms. First, protein descriptors are extracted from the structural coordinates stored in the Protein Data Bank (PDB) files. Then, a modified C4.5 algorithm is applied to select the most appropriate descriptor features for protein classification based on the GO hierarchy. An evaluation of this approach is presented, and the results show that the hierarchical structure of GO is important for improving the accuracy of the classification problem at higher levels.
C4.5 Classification, Gene Ontology, Protein function prediction.