Coding Potential Calculator distinguishes protein-coding from non-coding RNAs based on the sequence features of the input transcripts. Last version of CPC1 is widely used by worldwide researchers. For better serve for scientific community, we update CPC1 to CPC2. It can discriminate the coding and non-coding transcripts faster and more accurately. We provide an online version of CPC2 here.
CPC2 accepts RNA transcript sequences as input. Both fasta format and GTF/GFF/BED format are supported.
Size requirement: Less than 100000 lines in input box and no line limitation in batch model. Maximum allowable upload file size is 50 Mb
Name requirement: Sequence names beginning with ‘>’symbol are required. ID characters after blank character will be discarded in the results.
Sequence requirement: Only characters in DNA and RNA sequences are allowed.
Both BED6, BED12, GTF and GFF format are supported.
Less than 50000 lines. Maximum allowable upload file size is 50 Mb
GTF/GFF/BED file of following genome assembly are allowed:
Human (hg38), Human (hg19), Chimpanzee (panTro4), Mouse (mm10), Rat (rn6), Zebrafish (danRer7), Xenopus (xendTro3), Fruitfly (dm6)
NOTE: Input BED format will slow down the speed. Ensure your file is NO larger than 50 Mb and 10,000 lines.
There are 2 ways to submit lncRNA sequences:
Paste lncRNA sequences into the big input box at the home page.
Upload fasta file by the batch operation.
Also check the reverse complement strand:
Coding potential of both positive strand RNA and negative strand RNA will be assessed.
NOTE: The reverse complement strand check is NOT a part of the CPC 2 algorithm. We make the service available here to provide additional information for users. And switching on this option might slow the speed a little.
CPC 2 results html view gives an overview of coding status of the input sequences. Each row corresponds to one input sequence. The columns show the sequence ID, the coding/noncoding classification label, the coding probability (the "distance" to the SVM classification hyper-plane in the features space), scores of three features (putative peptide length, Fickett TESTCODE score, putative isoelectric point),rhe ORF integrity and the "Details" link (as described later).
Raw data of CPC 2 output results could be downloaded by clicking on "Download the result" button.
Raw data contains 9 columns are separated by tab, and each line stands for the result of an input sequence. For example:
The columns show the sequence ID, putative peptide length, Fickett score, isoelectric point, ORF start position, the integrity of the orf, coding probability and the coding/noncoding classification label.
Results details consist of a summary paragraph, a graphical view of detail information in feature distribution of your sequence and more information of additional evidences for coding potential.
We provide a graphic view of details in three features calculated by CPC 2 and its distribution in known coding/noncoding RNAs.
The background color area of the three images depict a total distribution of features in coding/noncoding RNAs. Blue area refers to features calculated from known noncoding transcripts in Ensembl database. Yellow area refers to features calculated from mRNAs in RefSeq database with SwissProt annotation.
Move mouse along the area edge to see a distribution frequency between the specific interval. Click on the button “Noncoding/Coding” above the graph to show or hide the distribution area of Noncoding/Coding RNAs as screenshot shows.
The black column shows the feature value of your input sequence and its position in the background distribution interval. And whether it points the coding distribution or noncoding distribution relies on the classification answer given by CPC 2.
You can download the graph by clicking on the right button.
To further help users to assess coding ability of input RNA sequences, we provided the BLAST searching against SwissProt, RNAdb, lncRNAdb as a supplementary module for CPC 2 web server.
Parameters for SwissProt:
blastx -evalue 1e-10 -ungapped -num_threads 2 -outfmt 6 -comp_based_stats F -query seq.fasta -db swissprot -out output_swissprot
Parameters for RNAdb and lncRNAdb:
blastn -num_threads 2 -outfmt 6 -query seq.fasta -db rnadb.fa -out output_rnadb
Users can view CPC 2 evidence results in its raw data format and in html format.
We provide a graphical view of predicted peptide and BLAST results summary. This page shows the position of putative ORF, and BLAST hsp.
Mousing over the ORF and blast hsp color bar, users can view details of predicted ORF and BLAST hits such as BLAST e-value of the hits.
Blast results in tabular format:
1. | qseqid | query (e.g., gene) sequence id |
2. | sseqid | subject (e.g., reference genome) sequence id |
3. | pident | percentage of identical matches |
4. | length | alignment length |
5. | mismatch | number of mismatches |
6. | gapopen | number of gap openings |
7. | qstart | start of alignment in query |
8. | qend | end of alignment in query |
9. | sstart | start of alignment in subject |
10. | send | end of alignment in subject |
11. | evalue | expect value |
12. | bitscore | bit score |
13. | databse | the databse of this query |
When user gives input sequences to the CPC2 web server for calculating, the web server will assign an unique Task ID to this request. After finishing the calculating, users can go to the Batch Page and use the Task ID to retrieve their results from CPC2 server.
In the updated CPC2, we employ a novel discriminative model based on sequence intrinsic features, which effectively addresses the two issues: the CPC2 runs two orders of magnitude faster than CPC1, and also be the fastest tool among other popular ones ; Meanwhile, the CPC2 shows superior accuracy than CPC1. The last but not the least, the updated model in CPC2, as in the CPC1, is species-neutral, making it feasible/accessible for the ever-growing non-model-organism transcriptomes.