PANNZER (Protein ANNotation with Z-scoRE) is a fully automated service for functional annotation of prokaryotic and eukaryotic proteins of unknown function. The tool is designed to predict the functional description (DE) and GO classes.

PANNZER2 processes bacterial proteomes in minutes and eukaryotic proteomes in an hour. You can use AAI-profiler to summarize a proteome's species neighbors and reveal taxonomic identity or contamination.

23 November - 31 December 2018: Principal investigators are invited to take part in Biocenter Finland's 3rd User Survey. Click yes to Bioinformatics (first row in Technology Platform Services), then find Holm group as last entry in Bioinformatics service units.

History

  • 2018: A tutorial was published on YouTube
  • 2018: PANNZER2 paper is published
  • 2017: PANNZER2 poster talk presented at ISMB-2017
  • 2017: PANNZER2 installed at the Center for Scientific Computing (CSC)
  • 2016: PANNZER2 web server and stand-alone program available
  • 2015: PANNZER code was refactored and Blast replaced by SANSparallel to create PANNZER2
  • 2015: PANNZER program published, please find the paper here
  • 2014: PANNZER ported to the Center for Scientific Computing (CSC), Finland's supercomputer center
  • 2014: The stand-alone version of PANNZER was made available
  • 2013: PANNZER was selected as the third best method in Critical Assessment of (protein) Function Annotation (CAFA 2011) challenge. Read more here.
  • 2012: PANNZER poster selected to the F1000 database can be found here

Contact

If you have any questions about the PANNZER program, please do not hesitate to ask Petri Toronen or Liisa Holm: firstname.lastname(at)helsinki.fi


Note that if you are submitting multiple sequences, then each sequence must have a unique protein identifier. Sequences must be in FASTA format.

STEP 1 - Enter your input protein sequence(s).

Paste protein sequences in FASTA format (example, text area limited to 10M characters):


or upload a FASTA file:

or paste the checksum of a recent batch job:

STEP 2 - Optional inputs.

Job title:

Scientific name of query species:

Sequence filtering
Minimum query coverage:
AND
OR
minimum sbjct coverage:

Minimum alignment length:

Sequence identity between and

DE prediction

Output only one DE per query

Form factor cutoff for informative/non-informative hits:

GO prediction

Scoring function: Argot Blast2GO Frequency Hypergeometric Jaccard RM3

Remove redundant GO terms

Blast2GO threshold:

STEP 3 - Select interactive or batch processing:

Interactive mode shows results for the first 10 sequences. Use the batch queue for larger input sets up to 100,000 sequences. You may leave an e-mail address for notification when the batch job has finished.

Interactive
Batch queue. E-mail address:

STEP 4 - Submit your job.

The results will appear in a new window.

PANNZER2 is able to perform high-throughput annotation of ten thousand sequences per hour. It uses the SANSparallel server to search for homologous proteins in the Uniprot database.


The PANNZER2 stand-alone version is now available! To download the PANNZER2 program package, click here.

System requirements

  1. Linux OS
  2. Python (https://www.python.org), modules:
    • numpy
    • scipy
    • fastcluster
    • requests
    • Standard modules:
      • ConfigParser
      • getopt
      • operator
      • os
      • math
      • random
      • socket
      • re
      • sys
      • SocketServer
      • threading
      • signal
      • time
      • argparse
  3. Perl
    • Switch module required for HTML output (anno2html.pl)
  4. Internet connection

Installation

  1. Download SANSPANZ.3.tar.gz to /home/you
  2. cd /home/you
  3. tar -zxvf SANSPANZ.3.tar.gz

Testing

  1. cd /home/you/SANSPANZ.3
  2. python runsanspanz.py -R -o ",DE.out,GO.out,anno.out" -s "Macaca mulatta" < testdata/querysequences.fasta
  3. perl anno2html.pl "Test result RM3" argot < anno.out > anno.html
The first command calls network servers for sequence search and GO database. It writes description prediction details to DE.out, GO prediction details to GO.out, and annotation summary to anno.out. The second command converts annotation summary to nested HTML tables with colour code. More test cases can be found in examples.csh and reference outputs in the testresults folder.
Here are re-annotation results for 78 reference proteomes from Ensembl (2017_04).

Here are re-annotation results for 66 reference proteomes from Ensembl (December 2015).

Figure 1: Overview of SANSPANZ (combination of SANSparallel and PANNZER).

The web server is the recommended way of using PANNZER. Standalone scripts (Download tab) are provided for the sake of transparency. Pannzer2 (this version) is a thousand times faster and more user-friendly than Pannzer1.

InterfaceEase of useSpeed
Pannzer2 web server+++(+)
Pannzer2 standalone with remote databases++
Pannzer2 standalone with local databases-++
Pannzer1 standalone----
Pannzer1 at CSC (discontinued)+-
Pannzer2 at CSC (integrated to Chipster)++

Web server

The query form is self-explanatory. Inputs are query sequences in FASTA format, the scientific name of the organism from which the query sequences come, and optionally an email address for notification when a batch job finishes. The scientific name is used by the RM3 scoring function to calculate taxonomic distances. If the field is left blank, PANNZER2 will automatically substitute a good guess of the query species. The web server's submission form is under the Annotate tab. Precomputed example results can be browsed under the Examples tab.

The query form has two options with slightly different outputs. Interactive jobs are limited to ten sequences and results are shown for the Argot predictor, which uses an information content weighted sum. Batch jobs are limited to 100,000 sequences. The resul page shows the submission and start time, how many query sequences have been processed, and time at finish. Toggle parameters to change predictors from Argot (default) to RM3 (a regression model similar to the original Pannzer article), Blast2GO, hypergeometric (HYGE) or Jaccard (JAC). If the query sequences are fragments, it can be beneficial to change the sequence filtering to accept hits that fulfil the coverage criterion for query OR sbjct (the default is AND). The HTML summary is paginated to report 1000 queries per file. The HTML pages are meant for browsing. The table has columns for gene symbol (GN), description (DE), three Gene Ontologies (GO) and an inverse mapping from the highest scoring GO prediction to enzyme classes (EC) or pathways (KEGG).

Text outputs are available for downloading. The downloadable file named "GO prediction details" and "DE prediction details" contain intermediate results for each query. The "annotations" file is a parseable file. It uses a context sensitive grammar, as explained below. The file has six tab-separated columns labelled qpid, type, score, PPV, id and desc. The first column (qpid) always contains the identifier of the query sequence. The second column (type) can take the following values, and the score, PPV, id and desc columns change meaning accordingly:

Stand-alone script

The script is available from the Download tab. Inputs for the examples of running runsanspanz.py below are in the testdata/ subdirectory of the distribution package and outputs can be found in the testresults/ subdirectory.

Simple usage with remote databases

The distribution package includes a simple script that can perform specific tasks. Help text is printed when the script is run without arguments. The script accesses remote sequence search and GO database servers, so no databases or servers need to be installed locally. The most frequently used command-line options are listed below.

Optionargumentdefault
-finput format, either FASTA or tabFASTA
-iinput fileSTDIN
-mmethodPannzer
-ooutput files (csv)STDOUT,method.out_1,method.out_2,...
-sspeciesautomatically parsed from Uniprot style header
-R[a flag to send server requests to remote servers]use local servers

This example uses default options for functional annotation of macaque proteins using Pannzer:

python runsanspanz.py -R -s "Macaca mulatta" < testdata/querysequences.fasta
perl anno2html.pl < Pannzer.out_3 > predictions.html
The output is written to STDOUT and files named Pannzer.out_1, Pannzer.out_2, Pannzer.out_3. Pannzer.out_1 contains details of the description (DE) prediction. Pannzer.out_2 contains details of the GO prediction. Pannzer.out_3 is a summary of all predicted annotations, which is converted to HTML by the anno2html.pl script.

Here, we direct outputs to files with descriptive names:

python runsanspanz.py -R -m Pannzer -s "Macaca mulatta" -i testdata/querysequences.fasta -o ",DE.out,GO.out,anno.out"
You can store the sequence search result from SANSparallel in a tabular format for later analysis. Below, the second command runs Pannzer using inputs in tabular format.
python runsanspanz.py -R -m SANS -s "Macaca mulatta" -i testdata/querysequences.fasta -o sans.tab
python runsanspanz.py -R -m Pannzer -i sans.tab -f tab -o ",DE.out1,GO.out1,anno.out1"
Pannzer2 reports predictions for descriptions (DE) and for GO terms using four predictors (RM3,ARGOT,JAC,HYGE). You can restrict the number of predictors used. ARGOT has been one the top predictors in our benchmarks.
python runsanspanz.py -R -m Pannzer -i sans.tab -f tab -o ",DE.out2,GO.out2,anno.out2" --PANZ_PREDICTOR "DE,ARGOT"
Pannzer2 applies strict filtering criteria to the sequence neighborhood, including default query and sbjct coverage of 70 %. The flag --PANZ_FILTER_PERMISSIVE relaxes the coverage criteria so that 70 % coverage is required of query or sbjct, but not both. This can result in more sequences getting predicted annotations (at the expense of accuracy).
python runsanspanz.py -R --PANZ_FILTER_PERMISSIVE -m Pannzer -i sans.tab -f tab -o ",DE.out3,GO.out3,anno.out3"
BLAST is more sensitive than SANSparallel at detecting distantly related proteins, but these are removed by Pannzer2 when filtering the sequence neighborhood, and BLAST is orders of magnitude slower than SANSparallel. If anyone desperately wants to use BLAST, we provide a script that converts BLAST output to our tabular format. The database must be Uniprot. Edit the variable $BLAST_EXE on line 11 of runblast.pl to correspond to your local environment.
perl runblast.pl testdata/querysequences.fasta testdata/uniprot.fasta 40 > blast.tab
python runsanspanz.py -R -m BestInformativeHit -i testdata/blast.tab -f tab -o ',--' 2> err
python runsanspanz.py -R -m Pannzer -i testdata/blast.tab -f tab -o ',DE.out4,GO.out4,anno.out4'
Pannzer expects protein sequences as input. We provide a utility script that extracts ORFs (longer than 80 aa) from nucleotide sequences:
perl longestorf.pl 80 < nucleotidesequences.fa > orfs.fasta
python runsanspanz.py -R -m Pannzer -i orfs.fasta -f FASTA -o --PANZ_PREDICTOR ARGOT ',,,orfs.anno'

Parameter configuration

Parameters can be changed using command line options, a configuration file, or within a Python script.

Command line options are shown by

python runsanspanz.py -h
A configuration file can be specified on the command line:
python runsanspanz.py -a myparameters.cfg
The configuration file need only define values for a subset of parameters. Other parameters will remain at their default values.

The current configuration can be written out to a file for later reuse or editing:

import Parameters
glob=Parameters.WorkSpace()
glob.writeConfigFile('myparameters.cfg')

Local database servers

SANSPANZ predicts functional descriptions and Gene Ontology (GO) terms using summary statistics calculated from the sequence neighborhood of the query sequence and database background frequencies. This leads to the complex architecture and web of dependencies depicted in Figure 1. There are two client-server interactions. SANSPANZ is a client sending requests to the SANSparallel and DictServer servers.

ServerPurposeParameters
SANSPANZ/DictServer.py [OPTIONS]Server providing database statistics and GO associations when SANSPANZ is run locally.CONN_HOSTNAME, CONN_PORTNO
SANSparallel serverServer providing sequence similarity search results when SANSPANZ is run locally.CONN_SANSHOST, CONN_SANSPORT

We run the servers on localhost at port 54321 and 50002, respectively. The servers are started as follows:

module add openmpi-x86_64
nohup mpirun -np 17 -output-filename /data/uniprot/u SANSparallel.2/server /data/uniprot/uniprot 54321 uniprot.Dec2015 &
nohup python ./SANSPANZ/DictServer.py --CONN_PORTNO 50002 &

Updating local databases

It is imperative that DictServer tables refer to the same database release that is searched by SANSparallel. Therefore, use both remote SANSparallel and DictServer servers, or install both locally.

We mirror Uniprot and GO-associations from EMBL-EBI:

rsync -u -a "rsync.ebi.ac.uk::pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz \
pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz \
pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz" /data/uniprot
gunzip gene_association.goa_uniprot.gz
Taxomic distance is a component of Pannzer's regression models. The taxonomy table contains the lineage of all sequences in Uniprot.
cd /data/uniprot
curl "http://www.uniprot.org/taxonomy?query=*&compress=yes&format=tab" > taxonomy-all.tab.gz
gunzip taxonomy-all.tab
The OBO file is part of the Gene Ontology (GO) and is required for GO prediction.
wget http://www.berkeleybop.org/ontologies/go/go-basic.obo
GO assignments are taken from the GOA file:
# GO dictionaries
rm -f goa_uniprot_all.gaf
wget ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz
gunzip goa_uniprot_all.gaf.gz
We include inverted ec2go and kegg2go mappings in GO prediction outputs:
rm -f ec2go kegg2go
wget http://geneontology.org/external2go/ec2go
perl external2go.pl < ec2go > ec2go.tab
wget http://geneontology.org/external2go/kegg2go
perl external2go.pl < kegg2go > kegg2go.tab
Uniprot is indexed for the SANSparallel server using scripts that are part of the SANSparallel distribution package:
perl /home/luholm/sansserver/saisformatdb.pl uniprot /data/uniprot/uniprot_sprot.fasta.gz /data/uniprot/uniprot_trembl.fasta.gz
Database statistics and dictionary tables (for DictServer) are composed using utility scripts/programs in the SANSPANZ/uniprot folder:
# update counts
cd SANSPANZ/
perl -pe 's/^\S+\s+//'  /data/uniprot/uniprot.phr | perl -pe 's/ \w{2}=.*$//' | python runsanspanz.py -m Cleandesc -f tab -c 'desc' | cut -f 2 > x
sort x | uniq -c > /data/uniprot/uniprot.desc.uc.counts
perl -pe 's/ +/\n/g' x | sort | uniq -c > /data/uniprot/uniprot.word.uc.counts
perl sumcol.pl 1 < /data/uniprot/uniprot.desc.uc.counts > /data/uniprot/nprot
perl sumcol.pl 1 < /data/uniprot/uniprot.word.uc.counts > /data/uniprot/nwordtotal
The following procedure computes the weight (information content) of each node given the GO structure in the obo file (go-basic.obo) and the counts of GO assignments in the database (Uniprot). We use all evidence codes to get more representative counts, and add one pseudocount to every GO term so that they have a defined weight.
# GO hierarchy and information content of GO terms
python runsanspanz.py -m obo -i /data/uniprot/go-basic.obo -o 'obo.tab,'
cut -f 2,5 /data/uniprot/goa_uniprot_all.gaf | python runsanspanz.py -m gaf2propagated -f tab -c "qpid goid" -o ",/data/uniprot/godict.txt" 2> err
cut -f 2 /data/uniprot/godict.txt | sort | uniq -c | perl -pe 's/^\s+//' | perl -pe 's/ /\t/' > godict_counts
python runsanspanz.py -m BayesIC -i godict_counts -f tab -c "qpid propagated" --eval_OBOTAB obo.tab -o ",obo_with_ic.tab" 
# GO propagated parent list using gorelatives
grep '^id: GO:' /data/uniprot/go-basic.obo | perl -pe 's/id: GO://' > /data/uniprot/go.list
./gorelatives -b go-basic.obo -q /data/uniprot/go.list -r isa,partof,altid -d parents -l > /data/uniprot/go_data
grep ^UniProtKB /data/uniprot/goa_uniprot_all.gaf | cut -f 2,5 | perl generate_godict.pl /data/uniprot/go_data obo_with_ic.tab ec2go.tab kegg2go.tab > /data/uniprot/mergeGO.out
DictServer reads mergeGO.out (glob.param['DATA_GOIDELIC']) and godict.txt (glob.param['DATA_GODICT']).

The species list is consulted by a CGI script, which is part of the web server, to give hints of the scientific names matching the prefix entered by the user:

cut -f 3 /data/uniprot/taxonomy-all.tab > specieslist

Evaluation of GO predictions

The runsanspanz.py script can be used to evaluate the correctness of GO predictions using similar metrics as in the CAFA competition. The evaluation metrics are
(1) F = 2 * pr * re / (pr + re) = 2 * TP / (T + P)

(2) J = TP / (T + P - TP)

(3) S = 1/ne * sqrt(mi^2 + ru^2) = 1/ne * sqrt( (wP-wTP)^2 + (wT-wTP)^2 )

(4) wF = 2 * wTP / (wT + wP)

(5) wJ = wTP / (wT + wP -wTP)
where pr is precision, re is recall, ru is remaining uncertainty and mi is misinformation. T, P and TP refer to [the cardinality of sets of] nodes in the GO hierarchy. T are the GO terms associated with a query protein in the reference of truth. P are the predicted GO terms above a score threshold tau. TP are the intersection of T and P ("true positives"). T, P and TP give a unit weight to each node. wT, wP and wTP weigh the nodes by their information content conditional on the parent nodes in the GO hierarchy being also true/predicted (Clark and Radivojac 2013). Fmax, wFmax, Jmax, wJmax and Smin are the optimum value reached at any score threshold. Smin is normalized by the number of proteins with predictions, ne.

Predicted GO annotations are compared to a reference of truth. Below, the truth is extracted from GOA file based on experimental evidence. Your actual test set is probably a subset of this.

cut -f 2,5,7 /data/uniprot/goa_uniprot_all.gaf | egrep 'EXP|IDA|IPI|IMP|IGI|IEP' | python runsanspanz.py -R -m gaf2propagated -f tab -c "qpid goid evidenceCode" -o ",goa_truth_propagated"
Generate predictions using the Pannzer method. Assuming that predictions and the reference of truth have matching protein accession numbers, evaluation metrics are output by the GOevaluation method:
python runsanspanz.py -R -m gaf2propagated -f tab -i testdata/gotest_truth.gaf -o ",gotest_truth_propagated" -c "qpid goid"
python runsanspanz.py -R -m GOevaluation -f tab -o ',,eval1' -i testdata/GO.out --eval_SCOREFUNCTIONS "RM3_PPV ARGOT_PPV JAC_PPV HYGE_PPV" --eval_TRUTH testdata/gotest_truth_propagated
As in CAFA, we provide a naive method uses the base frrequency of GO terms as the probability of prediction:
python runsanspanz.py -R -m naive -i testdata/target_identifiers.list -o ',naive_predictions.out' -c 'qpid' -f tab
python runsanspanz.py -R -m GOevaluation -i testdata/naive_predictions.out -f tab -o ',,eval2' --eval_SCOREFUNCTIONS "frequency" --eval_TRUTH testdata/gotest_truth_propagated  2> err
paste eval1 eval2

Evaluation of DE predictions

DSM is a description similarity measure which is the cosine similarity of TF-IDF weighted word vectors. It can be used to compare DE predictions by Pannzer to a reference of truth (e.g., the original descriotions). The following example first runs Pannzer to generate DE predictions, prepares a reference of truth, and finally runs the evaluation:
python runsanspanz.py -R -m Pannzer --PANZ_PREDICTOR DE -i human_subset.fasta -f FASTA -o 'human_subset.sans,DE.out,,' 
grep '^>' human_subset.fasta| perl -pe 's/ /\t/' | perl -pe 's/^>//' | python runsanspanz.py -R -m wordweights -c "qpid desc" -f tab -k 1000 > human_subset_de.tab
python runsanspanz.py -R -m DE_evaluation -i DE.out -f tab --eval_TRUTH human_subset_de.tab -o ",DE.eval" 

Customization

Pannzer2 is implemented in the SANSPANZ framework. New functionalities are easily implemented by creating new operator classes.

PANNZER and SANS references:

  1. Koskinen P, Holm L (2012) SANS: High-throughput retrieval of protein sequences allowing 50 % mismatches. Bioinformatics 28, i438-i443
  2. Radivojac, P., Clark , W., Oron , T., Schnoes , A., Wittkop , A., Sokolov ,A., Graim , K., Funk , C., Verspoor , K., Ben-Hur , A., Pandey , G., Yunes , G., Talwalkar , A., Repo , S., Souza , M., Piovesan ,D., Casadio , R., Wang , Z., Cheng , Z., Fang , H., Gough , J., Koskinen , P., Törönen , P., Nokso-Koivisto , J., Holm , L., Cozzetto , D., Buchan , D., Bryson, K., Jones , D., Limaye, B., Inamdar , H., Datta, A., Manjari , S., Joshi , R., Chitale , M., Kihara, D., Lisewski , A.M., Erdin , S., Venner , E., Lichtarge , O., Rentzsch , R., Yang ,H., Romero , A., Bhat , P., Paccanaro , A., Hamp , T., Kassner , R., Seemayer , S., Vicedo , E., Schaefer , C., Achten , D., Auer , F., Boehm , A., Braun , T., Hecht , M., Heron , M., Honigschmid , P., Hopf , T., Kaufmann , S., Kiening , M., Krompass , D., Landerer , C., Mahlich , Y., Roos , M., Björne , J., Salakoski , T., Wong , A., Shatkay , H., Gatzmann , F., Sommer , I., Wass , M., Sternberg , M., Skunca , N., Supek , F., Bošnjak , M., Panov , P., Dzeroski , S., Šmuc , T., Kourmpetis , Y., van Dijk , A., ter Braak , C., Zhou , Y., Gong , Q., Dong , X., Tian , W., Falda , M., Fontana , P., Lavezzo ,E., Di Camillo , B., Toppo , S., Lan , L., Djuric , N., Guo , Y., Vucetic , S., Bairoch , A., Linial , M., Babbitt , P., Brenner , S., Orengo , O., Rost , B., Mooney , S. & Friedberg, I. (2013) "A large-scale evaluation of computational protein function prediction". Nature Methods 10,221-227
  3. P Koskinen, P Törönen, J Nokso-Koivisto, L Holm (2015) PANNZER - High-throughput functional annotation of uncharacterized proteins in an error-prone environment. Bioinformatics 31 (10), 1544-1552
  4. Somervuo P, Holm L (2015) SANSparallel: interactive homology search against Uniprot. Nucl. Acids Res. 43, W24-W29
  5. Toronen P, Medlar A, Holm L (2018) PANNZER2: A rapid functional annotation webserver. Nucl. Acids Res.46, W84-W88

Genome sequencing projects that use PANNZER in functional annotation:

Eukaryotes

-Betula pendula (Silver birch) (publication)
-Melitaea cinxia (Glanville fritillary butterfly) (publication)
-Pusa hispida saimensis (Saimaa ringed seal) [unpublished]
-Zea mays (maize) evaluation

Bacteria

-Dickeya solani (publication)
-Halorubrum sp. PV6 [unpublished]
-Lactobacillus oligofermentans 22743 (publication)
-Lactococcus piscium MKFS47 (publication)
-Leptospirillum ferriphilum (publication)
-Leuconostoc gasicomitatum KG1-16 (publication)
-Mycobacterium avius (publication)
-Neorhizobium galegae (publication)
-Pectobacterium carotovorum [publication]
-Pectobacterium wasabiae SCC3193 (publication)
-Propionibacterium freudenreichii DSM 20271T (publication)
-Staphylococcus chromogenes [unpublished]

Transcriptome sequencing projects that use PANNZER in functional annotation:

Eukaryotes

-Actinodium [unpublished]
-Conium [unpublished]
-Fragaria vesca (publication)
-Gerbera [unpublished]
-Hydrangea [unpublished]
-Malagasy dung beetles (multiple species) [unpublished]
-Penicillium (multiple species) [unpublished]
-Phlebia radiata (publication)
-Pinus sylvestris [unpublished]
-Podosphaera plantaginis (publication)
-Taphrina betulina [unpublished]
-Viburnum [unpublished]