PANNZER (Protein ANNotation with Z-scoRE) is a fully automated service for functional annotation of prokaryotic and eukaryotic proteins of unknown function. The tool is designed to predict the functional description (DE) and GO classes. PANNZER2 processes bacterial proteomes in minutes and eukaryotic proteomes in an hour.

History

Contact

If you have any questions about the PANNZER program, please do not hesitate to ask Petri Toronen or Liisa Holm: firstname.lastname(at)helsinki.fi


STEP 1 - Enter your input protein sequence(s).

Paste protein sequences in FASTA format:


or upload a FASTA file:

STEP 2 - Enter the scientific name of the organism:

Start typing, then select from list of alternatives. If your organism is not in the list, select a close relative.

STEP 3 - Select interactive or batch processing:

Interactive mode shows results for the first 10 sequences. Use the batch queue for larger input sets up to 100,000 sequences. You may leave an e-mail address for notification when the batch job has finished.

Interactive
Batch queue. E-mail address:

STEP 4 - Submit your job

The results will appear in a new window.

PANNZER2 is able to perform high-throughput annotation of ten thousand sequences per hour. It uses the SANSparallel server to search for homologous proteins in the Uniprot database.


The PANNZER2 stand-alone version is now available! To download the PANNZER2 program package, click here.

System requirements

  1. Linux OS
  2. Python (https://www.python.org), modules:
    • numpy
    • scipy
    • fastcluster
    • requests
    • Standard modules:
      • ConfigParser
      • getopt
      • operator
      • os
      • math
      • random
      • socket
      • re
      • sys
      • SocketServer
      • threading
      • signal
      • time
      • argparse
  3. Perl
  4. Internet connection

Installation

  1. Download SANSPANZ2.tar.gz to /home/you
  2. cd /home/you
  3. tar -zxvf SANSPANZ2.tar.gz

Testing

  1. cd /home/you/SANSPANZ2
  2. python runsanspanz.py -R -o ",DE.out,GO.out,anno.out' -s "Macaca mulatta" < testdata/querysequences.fasta
  3. perl anno2html.pl "Test result" < anno.out > anno.html
The first command calls network servers for sequence search and GO database. It writes description prediction details to DE.out, GO prediction details to GO.out, and annotation summary to anno.out. The second command converts annotation summary to nested HTML tables with colour code. More test cases can be found in examples.csh and reference outputs in the testresults folder.

Here are re-annotation results for 66 reference proteomes from Ensembl.

Figure 1: Overview of SANSPANZ (combination of SANSparallel and PANNZER).

The web server is the recommended way of using PANNZER. Standalone scripts (Download tab) are provided for the sake of transparency. Pannzer2 (this version) is a thousand times faster and more user-friendly than Pannzer1.

InterfaceEase of useSpeed
Pannzer2 web server+++(+)
Pannzer2 standalone with remote databases++
Pannzer2 standalone with local databases-++
Pannzer1 standalone----
Pannzer1 at CSC (discontinued)+-
Pannzer2 at CSC (integrated to Chipster)++

Web server

The query form and results are self-explanatory. Inputs are query sequences in FASTA format, the scientific name of the organism from which the query sequences come, and optionally an email address for notification when a batch job finishes. The web server's submission form is under the Annotate tab.

Stand-alone script

The script is available from the Download tab. Inputs for the examples of running runsanspanz.py below are in the testdata/ subdirectory of the distribution package and outputs can be found in the testresults/ subdirectory.

Simple usage with remote databases

The distribution package includes a simple script that can perform specific tasks. Help text is printed when the script is run without arguments. The script accesses remote sequence search and GO database servers, so no databases or servers need to be installed locally. The most frequently used command-line options are listed below.

Optionargumentdefault
-finput format, either FASTA or tabFASTA
-iinput fileSTDIN
-mmethodPannzer
-ooutput files (csv)STDOUT,method.out_1,method.out_2,...
-sspeciesautomatically parsed from Uniprot style header
-R[a flag to send server requests to remote servers]use local servers

This example uses default options for functional annotation of macaque proteins using Pannzer:

python runsanspanz.py -R -s "Macaca mulatta" < testdata/querysequences.fasta
perl anno2html.pl < Pannzer.out_3 > predictions.html
The output is written to STDOUT and files named Pannzer.out_1, Pannzer.out_2, Pannzer.out_3. Pannzer.out_1 contains details of the description (DE) prediction. Pannzer.out_2 contains details of the GO prediction. Pannzer.out_3 is a summary of all predicted annotations, which is converted to HTML by the anno2html.pl script.

Here, we direct outputs to files with descriptive names:

python runsanspanz.py -R -m Pannzer -s "Macaca mulatta" -i testdata/querysequences.fasta -o ",DE.out,GO.out,anno.out"
You can store the sequence search result from SANSparallel in a tabular format for later analysis. Below, the second command runs Pannzer using inputs in tabular format.
python runsanspanz.py -R -m SANS -s "Macaca mulatta" -i testdata/querysequences.fasta -o sans.tab
python runsanspanz.py -R -m Pannzer -i sans.tab -f tab -o ",DE.out1,GO.out1,anno.out1"
Pannzer2 reports predictions for descriptions (DE) and for GO terms using four predictors (RM3,ARGOT,JAC,HYGE). You can restrict the number of predictors used. ARGOT has been one the top predictors in our benchmarks.
python runsanspanz.py -R -m Pannzer -i sans.tab -f tab -o ",DE.out2,GO.out2,anno.out2" --PANZ_PREDICTOR "DE,ARGOT"
Pannzer2 applies strict filtering criteria to the sequence neighborhood, including default query and sbjct coverage of 70 %. The flag --PANZ_FILTER_PERMISSIVE relaxes the coverage criteria so that 70 % coverage is required of query or sbjct, but not both. This can result in more sequences getting predicted annotations (at the expense of accuracy).
python runsanspanz.py -R --PANZ_FILTER_PERMISSIVE -m Pannzer -i sans.tab -f tab -o ",DE.out3,GO.out3,anno.out3"
BLAST is more sensitive than SANSparallel at detecting distantly related proteins, but these are removed by Pannzer2 when filtering the sequence neighborhood, and BLAST is orders of magnitude slower than SANSparallel. If anyone desperately wants to use BLAST, we provide a script that converts BLAST output to our tabular format. The database must be Uniprot. Edit the variable $BLAST_EXE on line 11 of runblast.pl to correspond to your local environment.
perl runblast.pl testdata/querysequences.fasta testdata/uniprot.fasta 40 > blast.tab
python runsanspanz.py -R -m BestInformativeHit -i testdata/blast.tab -f tab -o ',--' 2> err
python runsanspanz.py -R -m Pannzer -i testdata/blast.tab -f tab -o ',DE.out4,GO.out4,anno.out4'
Pannzer expects protein sequences as input. We provide a utility script that extracts ORFs (longer than 80 aa) from nucleotide sequences:
perl longestorf.pl 80 < nucleotidesequences.fa > orfs.fasta
python runsanspanz.py -R -m Pannzer -i orfs.fasta -f FASTA -o --PANZ_PREDICTOR ARGOT ',,,orfs.anno'

Parameter configuration

Parameters can be changed using command line options, a configuration file, or within a Python script.

Command line options are shown by

python runsanspanz.py -h
A configuration file can be specified on the command line:
python runsanspanz.py -a myparameters.cfg
The configuration file need only define values for a subset of parameters. Other parameters will remain at their default values.

The current configuration can be written out to a file for later reuse or editing:

import Parameters
glob=Parameters.WorkSpace()
glob.writeConfigFile('myparameters.cfg')

Local database servers

SANSPANZ predicts functional descriptions and Gene Ontology (GO) terms using summary statistics calculated from the sequence neighborhood of the query sequence and database background frequencies. This leads to the complex architecture and web of dependencies depicted in Figure 1. There are two client-server interactions. SANSPANZ is a client sending requests to the SANSparallel and DictServer servers.

ServerPurposeParameters
SANSPANZ/DictServer.py [OPTIONS]Server providing database statistics and GO associations when SANSPANZ is run locally.CONN_HOSTNAME, CONN_PORTNO
SANSparallel serverServer providing sequence similarity search results when SANSPANZ is run locally.CONN_SANSHOST, CONN_SANSPORT

We run the servers on localhost at port 54321 and 50002, respectively. The servers are started as follows:

module add openmpi-x86_64
nohup mpirun -np 17 -output-filename /data/uniprot/u SANSparallel.2/server /data/uniprot/uniprot 54321 uniprot.Dec2015 &
nohup python ./SANSPANZ/DictServer.py --CONN_PORTNO 50002 &

Updating local databases

It is imperative that DictServer tables refer to the same database release that is searched by SANSparallel. Therefore, use both remote SANSparallel and DictServer servers, or install both locally.

We mirror Uniprot and GO-associations from EMBL-EBI:

rsync -u -a "rsync.ebi.ac.uk::pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz \
pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz \
pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz" /data/uniprot
gunzip gene_association.goa_uniprot.gz
Taxomic distance is a component of Pannzer's regression models. The taxonomy table contains the lineage of all sequences in Uniprot.
cd /data/uniprot
curl "http://www.uniprot.org/taxonomy?query=*&compress=yes&format=tab" > taxonomy-all.tab.gz
gunzip taxonomy-all.tab
The OBO file is part of the Gene Ontology (GO) and is required for GO prediction.
wget http://www.berkeleybop.org/ontologies/go/go-basic.obo
GO assignments are taken from the GOA file:
# GO dictionaries
rm -f goa_uniprot_all.gaf
wget ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz
gunzip goa_uniprot_all.gaf.gz
We include inverted ec2go and kegg2go mappings in GO prediction outputs:
rm -f ec2go kegg2go
wget http://geneontology.org/external2go/ec2go
perl external2go.pl < ec2go > ec2go.tab
wget http://geneontology.org/external2go/kegg2go
perl external2go.pl < kegg2go > kegg2go.tab
Uniprot is indexed for the SANSparallel server using scripts that are part of the SANSparallel distribution package:
perl /home/luholm/sansserver/saisformatdb.pl uniprot /data/uniprot/uniprot_sprot.fasta.gz /data/uniprot/uniprot_trembl.fasta.gz
Database statistics and dictionary tables (for DictServer) are composed using utility scripts/programs in the SANSPANZ/uniprot folder:
# update counts
cd SANSPANZ/
perl -pe 's/^\S+\s+//'  /data/uniprot/uniprot.phr | perl -pe 's/ \w{2}=.*$//' | python runsanspanz.py -m Cleandesc -f tab -c 'desc' | cut -f 2 > x
sort x | uniq -c > /data/uniprot/uniprot.desc.uc.counts
perl -pe 's/ +/\n/g' x | sort | uniq -c > /data/uniprot/uniprot.word.uc.counts
perl sumcol.pl 1 < /data/uniprot/uniprot.desc.uc.counts > /data/uniprot/nprot
perl sumcol.pl 1 < /data/uniprot/uniprot.word.uc.counts > /data/uniprot/nwordtotal
The following procedure computes the weight (information content) of each node given the GO structure in the obo file (go-basic.obo) and the counts of GO assignments in the database (Uniprot). We use all evidence codes to get more representative counts, and add one pseudocount to every GO term so that they have a defined weight.
# GO hierarchy and information content of GO terms
python runsanspanz.py -m obo -i /data/uniprot/go-basic.obo -o 'obo.tab,'
cut -f 2,5 /data/uniprot/goa_uniprot_all.gaf | python runsanspanz.py -m gaf2propagated -f tab -c "qpid goid" -o ",/data/uniprot/godict.txt" 2> err
cut -f 2 /data/uniprot/godict.txt | sort | uniq -c | perl -pe 's/^\s+//' | perl -pe 's/ /\t/' > godict_counts
python runsanspanz.py -m BayesIC -i godict_counts -f tab -c "qpid propagated" --eval_OBOTAB obo.tab -o ",obo_with_ic.tab" 
# GO propagated parent list using gorelatives
grep '^id: GO:' /data/uniprot/go-basic.obo | perl -pe 's/id: GO://' > /data/uniprot/go.list
./gorelatives -b go-basic.obo -q /data/uniprot/go.list -r isa,partof,altid -d parents -l > /data/uniprot/go_data
grep ^UniProtKB /data/uniprot/goa_uniprot_all.gaf | cut -f 2,5 | perl generate_godict.pl /data/uniprot/go_data obo_with_ic.tab ec2go.tab kegg2go.tab > /data/uniprot/mergeGO.out
DictServer reads mergeGO.out (glob.param['DATA_GOIDELIC']) and godict.txt (glob.param['DATA_GODICT']).

The species list is consulted by a CGI script, which is part of the web server, to give hints of the scientific names matching the prefix entered by the user:

cut -f 3 /data/uniprot/taxonomy-all.tab > specieslist

Evaluation of GO predictions

The runsanspanz.py script can be used to evaluate the correctness of GO predictions using similar metrics as in the CAFA competition. The evaluation metrics are
(1) F = 2 * pr * re / (pr + re) = 2 * TP / (T + P)

(2) J = TP / (T + P - TP)

(3) S = 1/ne * sqrt(ru^2 + mi^2) = 1/ne * sqrt( (wP-wTP)^2 + (wT-wTP)^2 )

(4) wF = 2 * wTP / (wT + wP)

(5) wJ = wTP / (wT + wP -wTP)
where pr is precision, re is recall, ru is remaining uncertainty and mi is misinformation. T, P and TP refer to [the cardinality of sets of] nodes in the GO hierarchy. T are the GO terms associated with a query protein in the reference of truth. P are the predicted GO terms above a score threshold tau. TP are the intersection of T and P ("true positives"). T, P and TP give a unit weight to each node. wT, wP and wTP weigh the nodes by their information content conditional on the parent nodes in the GO hierarchy being also true/predicted (Clark and Radivojac 2013). Fmax, wFmax, Jmax, wJmax and Smin are the optimum value reached at any score threshold. Smin is normalized by the number of proteins with predictions, ne.

Predicted GO annotations are compared to a reference of truth. Below, the truth is extracted from GOA file based on experimental evidence. Your actual test set is probably a subset of this.

cut -f 2,5,7 /data/uniprot/goa_uniprot_all.gaf | egrep 'EXP|IDA|IPI|IMP|IGI|IEP' | python runsanspanz.py -R -m gaf2propagated -f tab -c "qpid goid evidenceCode" -o ",goa_truth_propagated"
Generate predictions using the Pannzer method. Assuming that predictions and the reference of truth have matching protein accession numbers, evaluation metrics are output by the GOevaluation method:
python runsanspanz.py -R -m gaf2propagated -f tab -i testdata/gotest_truth.gaf -o ",gotest_truth_propagated" -c "qpid goid"
python runsanspanz.py -R -m GOevaluation -f tab -o ',,eval1' -i testdata/GO.out --eval_SCOREFUNCTIONS "RM3_PPV ARGOT_PPV JAC_PPV HYGE_PPV" --eval_TRUTH testdata/gotest_truth_propagated
As in CAFA, we provide a naive method uses the base frrequency of GO terms as the probability of prediction:
python runsanspanz.py -R -m naive -i testdata/target_identifiers.list -o ',naive_predictions.out' -c 'qpid' -f tab
python runsanspanz.py -R -m GOevaluation -i testdata/naive_predictions.out -f tab -o ',,eval2' --eval_SCOREFUNCTIONS "frequency" --eval_TRUTH testdata/gotest_truth_propagated  2> err
paste eval1 eval2

Evaluation of DE predictions

DSM is a description similarity measure which is the cosine similarity of TF-IDF weighted word vectors. It can be used to compare DE predictions by Pannzer to a reference of truth (e.g., the original descriotions). The following example first runs Pannzer to generate DE predictions, prepares a reference of truth, and finally runs the evaluation:
python runsanspanz.py -R -m Pannzer --PANZ_PREDICTOR DE -i human_subset.fasta -f FASTA -o 'human_subset.sans,DE.out,,' 
grep '^>' human_subset.fasta| perl -pe 's/ /\t/' | perl -pe 's/^>//' | python runsanspanz.py -R -m wordweights -c "qpid desc" -f tab -k 1000 > human_subset_de.tab
python runsanspanz.py -R -m DE_evaluation -i DE.out -f tab --eval_TRUTH human_subset_de.tab -o ",DE.eval" 

Customization

Pannzer2 is implemented in the SANSPANZ framework. New functionalities are easily implemented by creating new operator classes.

PANNZER and SANS references:

  1. Koskinen P, Holm L (2012) SANS: High-throughput retrieval of protein sequences allowing 50 % mismatches. Bioinformatics 28, i438-i443
  2. Radivojac, P., Clark , W., Oron , T., Schnoes , A., Wittkop , A., Sokolov ,A., Graim , K., Funk , C., Verspoor , K., Ben-Hur , A., Pandey , G., Yunes , G., Talwalkar , A., Repo , S., Souza , M., Piovesan ,D., Casadio , R., Wang , Z., Cheng , Z., Fang , H., Gough , J., Koskinen , P., Törönen , P., Nokso-Koivisto , J., Holm , L., Cozzetto , D., Buchan , D., Bryson, K., Jones , D., Limaye, B., Inamdar , H., Datta, A., Manjari , S., Joshi , R., Chitale , M., Kihara, D., Lisewski , A.M., Erdin , S., Venner , E., Lichtarge , O., Rentzsch , R., Yang ,H., Romero , A., Bhat , P., Paccanaro , A., Hamp , T., Kassner , R., Seemayer , S., Vicedo , E., Schaefer , C., Achten , D., Auer , F., Boehm , A., Braun , T., Hecht , M., Heron , M., Honigschmid , P., Hopf , T., Kaufmann , S., Kiening , M., Krompass , D., Landerer , C., Mahlich , Y., Roos , M., Björne , J., Salakoski , T., Wong , A., Shatkay , H., Gatzmann , F., Sommer , I., Wass , M., Sternberg , M., Skunca , N., Supek , F., Bošnjak , M., Panov , P., Dzeroski , S., Šmuc , T., Kourmpetis , Y., van Dijk , A., ter Braak , C., Zhou , Y., Gong , Q., Dong , X., Tian , W., Falda , M., Fontana , P., Lavezzo ,E., Di Camillo , B., Toppo , S., Lan , L., Djuric , N., Guo , Y., Vucetic , S., Bairoch , A., Linial , M., Babbitt , P., Brenner , S., Orengo , O., Rost , B., Mooney , S. & Friedberg, I. (2013) "A large-scale evaluation of computational protein function prediction". Nature Methods 10,221-227
  3. P Koskinen, P Törönen, J Nokso-Koivisto, L Holm (2015) PANNZER - High-throughput functional annotation of uncharacterized proteins in an error-prone environment. Bioinformatics 31 (10), 1544-1552
  4. Somervuo P, Holm L (2015) SANSparallel: interactive homology search against Uniprot. Nucl. Acids Res. 43, W24-W29

Genome sequencing projects that use PANNZER in functional annotation:

Eukaryotes

-Melitaea cinxia (Glanville fritillary butterfly) (publication)
-Pusa hispida saimensis (Saimaa ringed seal) [unpublished]
-Betula pendula (Silver birch) (publication)

Bacteria

-Pectobacterium wasabiae SCC3193 (publication)
-Dickeya solani (publication)
-Propionibacterium freudenreichii DSM 20271T (publication)
-Neorhizobium galegae (publication)
-Staphylococcus chromogenes [unpublished]
- Pectobacterium carotovorum [unpublished]
-Lactobacillus oligofermentans 22743 (publication)
-Lactococcus piscium MKFS47 (publication)
-Leuconostoc gasicomitatum KG1-16 [unpublished]

Transcriptome sequencing projects that use PANNZER in functional annotation:

Eukaryotes

-Podosphaera plantaginis (publication)
-Taphrina betulina [unpublished]
-Malagasy dung beetles (multiple species) [unpublished]
-Penicillium (multiple species) [unpublished]
-Gerbera [unpublished]
-Conium [unpublished]
-Pinus sylvestris [unpublished]
-Actinodium [unpublished]
-Hydrangea [unpublished]
-Viburnum [unpublished]