MiPred:

Classification of real and pseudo microRNA precursors using random forest prediction model with combined features


Introduction

Because it is difficult to systematically detect miRNAs from a genome by existing experiment techniques, computational methods play important roles in the identification of miRNAs. It has been reported that miRNA genes are conserved in the primary sequences and secondary structures. Thus the comparative genomics based methods were adopted to find novel miRNAs in specific animals and plants. Although those comparative genomics based methods provided important techniques to predict new miRNAs, it is unable to identify novel miRNAs for which there are no known close homologies either due to the limitation of the data or due to the possible evolution of miRNAs. Furthermore, for a species that does not have a closely related species sequenced, its miRNAs can not be studied with the comparative genomics approaches. So it is in high demand for ab initio prediction methods of miRNAs. It is a fact that almost all pre-miRNAs have the characteristic of stem-loop hairpin structures. Therefore those hairpin structures give key clues to the ab initio prediction of pre-miRNAs.

MiPred  decides whether the input RNA sequence is a pre-miRNA-like hairpin sequence or not. If the sequence is a pre-miRNA-like hairpin, the RF classifier will predict whether it is a real pre-miRNA or a pseudo one.

 


Input

   Users can enter a RNA sequence (uppercase or lowercase) in one of the FASTA, GCG, GeneBank or EMBL format. All non-standard characters except the four nucleotides bases adenine, guanine, cytosine and uracil will be ignored from the sequence. Because the dinucleotide shuffling (shuffling times = 1000) is a time consuming process, a large batch queries may take a lot of time. That is the reason why we only allow three queries in one run. However we have two alternative ways to solve the problem: (a) we distribute the source code (Perl script) of our methods and the large batch users can run on their local computers (the source code is available upon request); (b) the large batch users can also send their sequences to us <jiangpeng1105@seu.edu.cn>. We will run on our local computers and the results will be returned by E-mail.

Output

After the analysis, the results are shown in a user-friendly format.

An output example:

กก

กก

กก