Monday, February 10, 2020
Efficiency of Clustering algorithms for mining large biological data Research Paper
Efficiency of Clustering algorithms for mining large biological data bases - Research Paper Example They are categorized into portioning, hierarchical and graph-based techniques. The most widely used of the three algorithms are the graph-based technique, and the hierarchical technique. However, the partitioning techniques are used in other disciplines; it is less used in gene sequence clustering and as such, there is no substantial theory of whether the partitioning methods are efficient. This study analyzes four clustering mining algorithms using four large protein sequence data sets. The analysis highlights the weakness and shortcomings of the four and proposes a new algorithm based on the shortcomings of the four algorithms. Introduction Today, protein sequences are more than one million (Sasson et al., 2002) and as such, there is need in bioinformatics for identifying meaningful patterns for the purposes of understanding their functions. For a long time, protein and gene sequences have been analyzed, compared and grouped using alignment methods. According to Cai et al. (2000), alignment methods are algorithms constructed to arrange, RNA, DNA, and protein sequences to detect similarities that may be as a result of evolutionary, functional or structural sequence relationships. Mount (2002) asserts that comparing and clustering sequences is done using pair-wise alignment method, which are of two types, global and local. Consequently, local alignment algorithm proposed by Waterman and Smith (Bolten et al., 2001) is utilized in identifying amino acid patterns that have been conserved in protein sequences. The global alignment algorithm proposed by Wunsh and Needleman (Bolten et al., 2001) is used to try and align many characters of the entire sequence. It is clear from the above that; the pair-wise alignment method is expensive when it comes to comparing and clustering a large protein data set. This is because there are very many comparisons performed during computation, since every single protein in a data set is compared to all the proteins in the data set ( Bolten et al., 2001). This brings into question the efficiency of the pair-wise alignment methods in comparing and clustering of large protein data sets. The pair-wise alignment method, both local and global, do not put into consideration the size of the data set, especially too large data sets that may overwhelm the computer memory. Han & Kamber (2000) argues that, unsupervised learning is aimed at identifying from a data set, a sensible partition or a natural pattern with the help of a distance function. Biology and life science fields have extensively exploited clustering techniques in sequence analysis to classify similar sequences into either protein or gen families (Galperin & Koonin, 2001). Currently, protein sequences can be classified in similar patterns using various, readily available sequencing and clustering methods. As had earlier been mentioned, these methods can be grouped as graph-based, partitioning and hierarchical methods. These methods, especially graph-based an d hierarchical methods, have been used consecutively or together to complement each other as argued by Sasson et al. (2002), Sperisen & Pagni (2005), Essoussi & Fayech (2007) and Enright & Ouzounis (2000). In the field of protein comparison and sequence clustering, there are very few instances in which partitioning techniques have been used. For instance, Guralnik & Karypis (2001) proposed an algorithm or sequencing method-on the
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.