PANNZER (Protein ANNotation with Z-scoRE) is a fully automated service for functional annotation of prokaryotic and eukaryotic proteins of unknown function. The tool is designed to predict the functional description (DE) and GO classes. PANNZER2 processes bacterial proteomes in minutes and eukaryotic proteomes in an hour.


  • 2017: PANNZER2 poster talk presented at ISMB-2017
  • 2017: PANNZER2 installed at the Center for Scientific Computing (CSC)
  • 2016: PANNZER2 web server and stand-alone program available
  • 2015: PANNZER code was refactored and Blast replaced by SANSparallel to create PANNZER2
  • 2015: PANNZER program published, please find the paper here
  • 2014: PANNZER ported to the Center for Scientific Computing (CSC), Finland's supercomputer center
  • 2014: The stand-alone version of PANNZER was made available
  • 2013: PANNZER was selected as the third best method in Critical Assessment of (protein) Function Annotation (CAFA 2011) challenge. Read more here.
  • 2012: PANNZER poster selected to the F1000 database can be found here


If you have any questions about the PANNZER program, please do not hesitate to ask Petri Toronen or Liisa Holm: firstname.lastname(at)

STEP 1 - Enter your input protein sequence(s).

Paste protein sequences in FASTA format (example, text area limited to 10M characters):

or upload a FASTA file:

or paste the checksum of a recent batch job:

STEP 2 - Optional inputs.

Job title:

Scientific name of query species:

Sequence filtering
Minimum query coverage:
minimum sbjct coverage:

Minimum alignment length:

Sequence identity between and

DE prediction

Output only one DE per query

Form factor cutoff for informative/non-informative hits:

GO prediction

Scoring function: Argot Blast2GO Frequency Hypergeometric Jaccard RM3

Remove redundant GO terms

Blast2GO threshold:

STEP 3 - Select interactive or batch processing:

Interactive mode shows results for the first 10 sequences. Use the batch queue for larger input sets up to 100,000 sequences. You may leave an e-mail address for notification when the batch job has finished.

Batch queue. E-mail address:

STEP 4 - Submit your job.

The results will appear in a new window.

PANNZER2 is able to perform high-throughput annotation of ten thousand sequences per hour. It uses the SANSparallel server to search for homologous proteins in the Uniprot database.

The PANNZER2 stand-alone version is now available! To download the PANNZER2 program package, click here.

System requirements

  1. Linux OS
  2. Python (, modules:
    • numpy
    • scipy
    • fastcluster
    • requests
    • Standard modules:
      • ConfigParser
      • getopt
      • operator
      • os
      • math
      • random
      • socket
      • re
      • sys
      • SocketServer
      • threading
      • signal
      • time
      • argparse
  3. Perl
    • Switch module required for HTML output (
  4. Internet connection


  1. Download SANSPANZ.3.tar.gz to /home/you
  2. cd /home/you
  3. tar -zxvf SANSPANZ.3.tar.gz


  1. cd /home/you/SANSPANZ.3
  2. python -R -o ",DE.out,GO.out,anno.out' -s "Macaca mulatta" < testdata/querysequences.fasta
  3. perl "Test result RM3" RM3 < anno.out > anno.html
The first command calls network servers for sequence search and GO database. It writes description prediction details to DE.out, GO prediction details to GO.out, and annotation summary to anno.out. The second command converts annotation summary to nested HTML tables with colour code. More test cases can be found in examples.csh and reference outputs in the testresults folder.
Here are re-annotation results for 78 reference proteomes from Ensembl (2017_04).

Here are re-annotation results for 66 reference proteomes from Ensembl (December 2015).

Figure 1: Overview of SANSPANZ (combination of SANSparallel and PANNZER).

The web server is the recommended way of using PANNZER. Standalone scripts (Download tab) are provided for the sake of transparency. Pannzer2 (this version) is a thousand times faster and more user-friendly than Pannzer1.

InterfaceEase of useSpeed
Pannzer2 web server+++(+)
Pannzer2 standalone with remote databases++
Pannzer2 standalone with local databases-++
Pannzer1 standalone----
Pannzer1 at CSC (discontinued)+-
Pannzer2 at CSC (integrated to Chipster)++

Web server

The query form is self-explanatory. Inputs are query sequences in FASTA format, the scientific name of the organism from which the query sequences come, and optionally an email address for notification when a batch job finishes. The scientific name is used by the RM3 scoring function to calculate taxonomic distances. If the field is left blank, PANNZER2 will automatically substitute a good guess of the query species. The web server's submission form is under the Annotate tab. Precomputed example results can be browsed under the Examples tab.

The query form has two options with slightly different outputs. Interactive jobs are limited to ten sequences and results are shown for the Argot predictor, which uses an information content weighted sum. Batch jobs are limited to 100,000 sequences. The resul page shows the submission and start time, how many query sequences have been processed, and time at finish. Toggle parameters to change predictors from Argot (default) to RM3 (a regression model similar to the original Pannzer article), Blast2GO, hypergeometric (HYGE) or Jaccard (JAC). If the query sequences are fragments, it can be beneficial to change the sequence filtering to accept hits that fulfil the coverage criterion for query OR sbjct (the default is AND). The HTML summary is paginated to report 1000 queries per file. The HTML pages are meant for browsing. The table has columns for gene symbol (GN), description (DE), three Gene Ontologies (GO) and an inverse mapping from the highest scoring GO prediction to enzyme classes (EC) or pathways (KEGG).

Text outputs are available for downloading. The downloadable file named "GO prediction details" and "DE prediction details" contain intermediate results for each query. The "annotations" file is a parseable file. It uses a context sensitive grammar, as explained below. The file has six tab-separated columns labelled qpid, type, score, PPV, id and desc. The first column (qpid) always contains the identifier of the query sequence. The second column (type) can take the following values, and the score, PPV, id and desc columns change meaning accordingly:

Stand-alone script

The script is available from the Download tab. Inputs for the examples of running below are in the testdata/ subdirectory of the distribution package and outputs can be found in the testresults/ subdirectory.

Simple usage with remote databases

The distribution package includes a simple script that can perform specific tasks. Help text is printed when the script is run without arguments. The script accesses remote sequence search and GO database servers, so no databases or servers need to be installed locally. The most frequently used command-line options are listed below.

-finput format, either FASTA or tabFASTA
-iinput fileSTDIN
-ooutput files (csv)STDOUT,method.out_1,method.out_2,...
-sspeciesautomatically parsed from Uniprot style header
-R[a flag to send server requests to remote servers]use local servers

This example uses default options for functional annotation of macaque proteins using Pannzer:

python -R -s "Macaca mulatta" < testdata/querysequences.fasta
perl < Pannzer.out_3 > predictions.html
The output is written to STDOUT and files named Pannzer.out_1, Pannzer.out_2, Pannzer.out_3. Pannzer.out_1 contains details of the description (DE) prediction. Pannzer.out_2 contains details of the GO prediction. Pannzer.out_3 is a summary of all predicted annotations, which is converted to HTML by the script.

Here, we direct outputs to files with descriptive names:

python -R -m Pannzer -s "Macaca mulatta" -i testdata/querysequences.fasta -o ",DE.out,GO.out,anno.out"
You can store the sequence search result from SANSparallel in a tabular format for later analysis. Below, the second command runs Pannzer using inputs in tabular format.
python -R -m SANS -s "Macaca mulatta" -i testdata/querysequences.fasta -o
python -R -m Pannzer -i -f tab -o ",DE.out1,GO.out1,anno.out1"
Pannzer2 reports predictions for descriptions (DE) and for GO terms using four predictors (RM3,ARGOT,JAC,HYGE). You can restrict the number of predictors used. ARGOT has been one the top predictors in our benchmarks.
python -R -m Pannzer -i -f tab -o ",DE.out2,GO.out2,anno.out2" --PANZ_PREDICTOR "DE,ARGOT"
Pannzer2 applies strict filtering criteria to the sequence neighborhood, including default query and sbjct coverage of 70 %. The flag --PANZ_FILTER_PERMISSIVE relaxes the coverage criteria so that 70 % coverage is required of query or sbjct, but not both. This can result in more sequences getting predicted annotations (at the expense of accuracy).
python -R --PANZ_FILTER_PERMISSIVE -m Pannzer -i -f tab -o ",DE.out3,GO.out3,anno.out3"
BLAST is more sensitive than SANSparallel at detecting distantly related proteins, but these are removed by Pannzer2 when filtering the sequence neighborhood, and BLAST is orders of magnitude slower than SANSparallel. If anyone desperately wants to use BLAST, we provide a script that converts BLAST output to our tabular format. The database must be Uniprot. Edit the variable $BLAST_EXE on line 11 of to correspond to your local environment.
perl testdata/querysequences.fasta testdata/uniprot.fasta 40 >
python -R -m BestInformativeHit -i testdata/ -f tab -o ',--' 2> err
python -R -m Pannzer -i testdata/ -f tab -o ',DE.out4,GO.out4,anno.out4'
Pannzer expects protein sequences as input. We provide a utility script that extracts ORFs (longer than 80 aa) from nucleotide sequences:
perl 80 < nucleotidesequences.fa > orfs.fasta
python -R -m Pannzer -i orfs.fasta -f FASTA -o --PANZ_PREDICTOR ARGOT ',,,orfs.anno'

Parameter configuration

Parameters can be changed using command line options, a configuration file, or within a Python script.

Command line options are shown by

python -h
A configuration file can be specified on the command line:
python -a myparameters.cfg
The configuration file need only define values for a subset of parameters. Other parameters will remain at their default values.

The current configuration can be written out to a file for later reuse or editing:

import Parameters

Local database servers

SANSPANZ predicts functional descriptions and Gene Ontology (GO) terms using summary statistics calculated from the sequence neighborhood of the query sequence and database background frequencies. This leads to the complex architecture and web of dependencies depicted in Figure 1. There are two client-server interactions. SANSPANZ is a client sending requests to the SANSparallel and DictServer servers.

SANSPANZ/ [OPTIONS]Server providing database statistics and GO associations when SANSPANZ is run locally.CONN_HOSTNAME, CONN_PORTNO
SANSparallel serverServer providing sequence similarity search results when SANSPANZ is run locally.CONN_SANSHOST, CONN_SANSPORT

We run the servers on localhost at port 54321 and 50002, respectively. The servers are started as follows:

module add openmpi-x86_64
nohup mpirun -np 17 -output-filename /data/uniprot/u SANSparallel.2/server /data/uniprot/uniprot 54321 uniprot.Dec2015 &
nohup python ./SANSPANZ/ --CONN_PORTNO 50002 &

Updating local databases

It is imperative that DictServer tables refer to the same database release that is searched by SANSparallel. Therefore, use both remote SANSparallel and DictServer servers, or install both locally.

We mirror Uniprot and GO-associations from EMBL-EBI:

rsync -u -a " \
pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz \
pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz" /data/uniprot
gunzip gene_association.goa_uniprot.gz
Taxomic distance is a component of Pannzer's regression models. The taxonomy table contains the lineage of all sequences in Uniprot.
cd /data/uniprot
curl "*&compress=yes&format=tab" >
The OBO file is part of the Gene Ontology (GO) and is required for GO prediction.
GO assignments are taken from the GOA file:
# GO dictionaries
rm -f goa_uniprot_all.gaf
gunzip goa_uniprot_all.gaf.gz
We include inverted ec2go and kegg2go mappings in GO prediction outputs:
rm -f ec2go kegg2go
perl < ec2go >
perl < kegg2go >
Uniprot is indexed for the SANSparallel server using scripts that are part of the SANSparallel distribution package:
perl /home/luholm/sansserver/ uniprot /data/uniprot/uniprot_sprot.fasta.gz /data/uniprot/uniprot_trembl.fasta.gz
Database statistics and dictionary tables (for DictServer) are composed using utility scripts/programs in the SANSPANZ/uniprot folder:
# update counts
perl -pe 's/^\S+\s+//'  /data/uniprot/uniprot.phr | perl -pe 's/ \w{2}=.*$//' | python -m Cleandesc -f tab -c 'desc' | cut -f 2 > x
sort x | uniq -c > /data/uniprot/uniprot.desc.uc.counts
perl -pe 's/ +/\n/g' x | sort | uniq -c > /data/uniprot/uniprot.word.uc.counts
perl 1 < /data/uniprot/uniprot.desc.uc.counts > /data/uniprot/nprot
perl 1 < /data/uniprot/uniprot.word.uc.counts > /data/uniprot/nwordtotal
The following procedure computes the weight (information content) of each node given the GO structure in the obo file (go-basic.obo) and the counts of GO assignments in the database (Uniprot). We use all evidence codes to get more representative counts, and add one pseudocount to every GO term so that they have a defined weight.
# GO hierarchy and information content of GO terms
python -m obo -i /data/uniprot/go-basic.obo -o ','
cut -f 2,5 /data/uniprot/goa_uniprot_all.gaf | python -m gaf2propagated -f tab -c "qpid goid" -o ",/data/uniprot/godict.txt" 2> err
cut -f 2 /data/uniprot/godict.txt | sort | uniq -c | perl -pe 's/^\s+//' | perl -pe 's/ /\t/' > godict_counts
python -m BayesIC -i godict_counts -f tab -c "qpid propagated" --eval_OBOTAB -o "," 
# GO propagated parent list using gorelatives
grep '^id: GO:' /data/uniprot/go-basic.obo | perl -pe 's/id: GO://' > /data/uniprot/go.list
./gorelatives -b go-basic.obo -q /data/uniprot/go.list -r isa,partof,altid -d parents -l > /data/uniprot/go_data
grep ^UniProtKB /data/uniprot/goa_uniprot_all.gaf | cut -f 2,5 | perl /data/uniprot/go_data > /data/uniprot/mergeGO.out
DictServer reads mergeGO.out (glob.param['DATA_GOIDELIC']) and godict.txt (glob.param['DATA_GODICT']).

The species list is consulted by a CGI script, which is part of the web server, to give hints of the scientific names matching the prefix entered by the user:

cut -f 3 /data/uniprot/ > specieslist

Evaluation of GO predictions

The script can be used to evaluate the correctness of GO predictions using similar metrics as in the CAFA competition. The evaluation metrics are
(1) F = 2 * pr * re / (pr + re) = 2 * TP / (T + P)

(2) J = TP / (T + P - TP)

(3) S = 1/ne * sqrt(mi^2 + ru^2) = 1/ne * sqrt( (wP-wTP)^2 + (wT-wTP)^2 )

(4) wF = 2 * wTP / (wT + wP)

(5) wJ = wTP / (wT + wP -wTP)
where pr is precision, re is recall, ru is remaining uncertainty and mi is misinformation. T, P and TP refer to [the cardinality of sets of] nodes in the GO hierarchy. T are the GO terms associated with a query protein in the reference of truth. P are the predicted GO terms above a score threshold tau. TP are the intersection of T and P ("true positives"). T, P and TP give a unit weight to each node. wT, wP and wTP weigh the nodes by their information content conditional on the parent nodes in the GO hierarchy being also true/predicted (Clark and Radivojac 2013). Fmax, wFmax, Jmax, wJmax and Smin are the optimum value reached at any score threshold. Smin is normalized by the number of proteins with predictions, ne.

Predicted GO annotations are compared to a reference of truth. Below, the truth is extracted from GOA file based on experimental evidence. Your actual test set is probably a subset of this.

cut -f 2,5,7 /data/uniprot/goa_uniprot_all.gaf | egrep 'EXP|IDA|IPI|IMP|IGI|IEP' | python -R -m gaf2propagated -f tab -c "qpid goid evidenceCode" -o ",goa_truth_propagated"
Generate predictions using the Pannzer method. Assuming that predictions and the reference of truth have matching protein accession numbers, evaluation metrics are output by the GOevaluation method:
python -R -m gaf2propagated -f tab -i testdata/gotest_truth.gaf -o ",gotest_truth_propagated" -c "qpid goid"
python -R -m GOevaluation -f tab -o ',,eval1' -i testdata/GO.out --eval_SCOREFUNCTIONS "RM3_PPV ARGOT_PPV JAC_PPV HYGE_PPV" --eval_TRUTH testdata/gotest_truth_propagated
As in CAFA, we provide a naive method uses the base frrequency of GO terms as the probability of prediction:
python -R -m naive -i testdata/target_identifiers.list -o ',naive_predictions.out' -c 'qpid' -f tab
python -R -m GOevaluation -i testdata/naive_predictions.out -f tab -o ',,eval2' --eval_SCOREFUNCTIONS "frequency" --eval_TRUTH testdata/gotest_truth_propagated  2> err
paste eval1 eval2

Evaluation of DE predictions

DSM is a description similarity measure which is the cosine similarity of TF-IDF weighted word vectors. It can be used to compare DE predictions by Pannzer to a reference of truth (e.g., the original descriotions). The following example first runs Pannzer to generate DE predictions, prepares a reference of truth, and finally runs the evaluation:
python -R -m Pannzer --PANZ_PREDICTOR DE -i human_subset.fasta -f FASTA -o 'human_subset.sans,DE.out,,' 
grep '^>' human_subset.fasta| perl -pe 's/ /\t/' | perl -pe 's/^>//' | python -R -m wordweights -c "qpid desc" -f tab -k 1000 >
python -R -m DE_evaluation -i DE.out -f tab --eval_TRUTH -o ",DE.eval" 


Pannzer2 is implemented in the SANSPANZ framework. New functionalities are easily implemented by creating new operator classes.

PANNZER and SANS references:

  1. Koskinen P, Holm L (2012) SANS: High-throughput retrieval of protein sequences allowing 50 % mismatches. Bioinformatics 28, i438-i443
  2. Radivojac, P., Clark , W., Oron , T., Schnoes , A., Wittkop , A., Sokolov ,A., Graim , K., Funk , C., Verspoor , K., Ben-Hur , A., Pandey , G., Yunes , G., Talwalkar , A., Repo , S., Souza , M., Piovesan ,D., Casadio , R., Wang , Z., Cheng , Z., Fang , H., Gough , J., Koskinen , P., Törönen , P., Nokso-Koivisto , J., Holm , L., Cozzetto , D., Buchan , D., Bryson, K., Jones , D., Limaye, B., Inamdar , H., Datta, A., Manjari , S., Joshi , R., Chitale , M., Kihara, D., Lisewski , A.M., Erdin , S., Venner , E., Lichtarge , O., Rentzsch , R., Yang ,H., Romero , A., Bhat , P., Paccanaro , A., Hamp , T., Kassner , R., Seemayer , S., Vicedo , E., Schaefer , C., Achten , D., Auer , F., Boehm , A., Braun , T., Hecht , M., Heron , M., Honigschmid , P., Hopf , T., Kaufmann , S., Kiening , M., Krompass , D., Landerer , C., Mahlich , Y., Roos , M., Björne , J., Salakoski , T., Wong , A., Shatkay , H., Gatzmann , F., Sommer , I., Wass , M., Sternberg , M., Skunca , N., Supek , F., Bošnjak , M., Panov , P., Dzeroski , S., Šmuc , T., Kourmpetis , Y., van Dijk , A., ter Braak , C., Zhou , Y., Gong , Q., Dong , X., Tian , W., Falda , M., Fontana , P., Lavezzo ,E., Di Camillo , B., Toppo , S., Lan , L., Djuric , N., Guo , Y., Vucetic , S., Bairoch , A., Linial , M., Babbitt , P., Brenner , S., Orengo , O., Rost , B., Mooney , S. & Friedberg, I. (2013) "A large-scale evaluation of computational protein function prediction". Nature Methods 10,221-227
  3. P Koskinen, P Törönen, J Nokso-Koivisto, L Holm (2015) PANNZER - High-throughput functional annotation of uncharacterized proteins in an error-prone environment. Bioinformatics 31 (10), 1544-1552
  4. Somervuo P, Holm L (2015) SANSparallel: interactive homology search against Uniprot. Nucl. Acids Res. 43, W24-W29

Genome sequencing projects that use PANNZER in functional annotation:


-Melitaea cinxia (Glanville fritillary butterfly) (publication)
-Pusa hispida saimensis (Saimaa ringed seal) [unpublished]
-Betula pendula (Silver birch) (publication)


-Pectobacterium wasabiae SCC3193 (publication)
-Dickeya solani (publication)
-Propionibacterium freudenreichii DSM 20271T (publication)
-Neorhizobium galegae (publication)
-Staphylococcus chromogenes [unpublished]
- Pectobacterium carotovorum [publication]
-Lactobacillus oligofermentans 22743 (publication)
-Lactococcus piscium MKFS47 (publication)
-Leuconostoc gasicomitatum KG1-16 [unpublished]

Transcriptome sequencing projects that use PANNZER in functional annotation:


-Podosphaera plantaginis (publication)
-Taphrina betulina [unpublished]
-Malagasy dung beetles (multiple species) [unpublished]
-Penicillium (multiple species) [unpublished]
-Gerbera [unpublished]
-Conium [unpublished]
-Pinus sylvestris [unpublished]
-Actinodium [unpublished]
-Hydrangea [unpublished]
-Viburnum [unpublished]