SANSPANZ framework: For developers

SANSPANZ is a framework for homology based protein function prediction and evaluation of GO predictions. It is written in Python and has a modular design that is easy to extend.

SANSPANZ provides access methods to two core servers: SANSparallel, for sequence similarity search, and the DictServer, a key/value store of metadata related to protein annotations (GO annotations, GO class frequencies, description line word frequencies, species taxonomy, etc). SANSPANZ client scripts access the servers via TCP (local servers) or HTTP protocols (remote servers). The remote servers are maintained by the authors (Holm group).

SANSPANZ accepts query sequences in FASTA format or tabular format similar to BLAST output format 6. The main loop breaks the input data into batches and, in the case of raw protein sequences, calls the SANSparallel server to get the sequence neighbourhood for each query. For all database hits, the DictServer is called to retrieve annotation metadata.

Data processing is based around a two-dimensional data matrix containing individuals (rows) and attributes (columns). Columns and rows can be appended to SpreadSheets by classes that inherit from the Operator class. Operators are defined as either row-wise (per database hit) or block-wise (per input query) depending on the nature of the statistic being calculated. Within SANSPANZ, all data processing is performed by Operators which can even be chained together to create composite Operators.

The idea behind the SANSPANZ framework is that new functionality can be developed by simply writing a new operator class, which is called from the standard workflow (Figure 1).

Overview of the class library
Generic script
Worked example
Implementing and evaluating novel scoring functions for function prediction
1. Scoring function
2. Evaluation
Summary: how to write your own SANSPANZ application

Overview of the class library

There is help text in the Python code. The main classes are listed below.

Parameters Defines the workspace with spreadsheets, dictionaries, and parameters

Runner The main loop handling the input data stream

DictServer Handles communications with the core servers

SpreadSheet A data object

operators Subclassed from myoperator. Modifies the contents of spreadsheets. The only part a developer needs to touch.

Parameters	Defines the workspace with spreadsheets, dictionaries, and parameters
Runner	The main loop handling the input data stream
DictServer	Handles communications with the core servers
SpreadSheet	A data object
operators	Subclassed from myoperator. Modifies the contents of spreadsheets. The only part a developer needs to touch.

The operators can call utility functions including the following modules:

XMLParser Specific function to parse SANSparallel results

HYGE Functions for calculating hypergeometric statistics

GSZ Functions for calculating the Gene Set Z-score statistic

Read_and_Print Functions for loading dictionaries from files

PannzerFunctions Specific functions used by the Pannzer predictor.

GO_dist_functions A collection of GO distance related functions.

XMLParser	Specific function to parse SANSparallel results
HYGE	Functions for calculating hypergeometric statistics
GSZ	Functions for calculating the Gene Set Z-score statistic
Read_and_Print	Functions for loading dictionaries from files
PannzerFunctions	Specific functions used by the Pannzer predictor.
GO_dist_functions	A collection of GO distance related functions.

Figure 1. The main loop of SANSPANZ.

Generic script

This script will apply an operator to the input data stream. All parameters are stored in the dictionary glob.param. Default values can be overwritten by command line options or a configuration file. The glob object handle is passed to all operators. The appropriate data structures are created when Runner instantiates the operator. LazyRunner has two obligatory arguments: the input file ("--" for STDIN) and a list of output files ("--" for STDOUT). The default input format is FASTA.

import Parameters, Runner
if __name__ == '__main__' :
   glob=Parameters.WorkSpace(configFile=None)
   z=Runner.Runner(glob,operator_name=glob.param['input_OPERATOR'],CHUNK=glob.param['input_CHUNK'], liveData=glob.param['input_LIVEDATA'])
   z.lazyRunner(glob.param['input_FILE'], glob.param['input_OUTFILES'].split(","), queryspecies=glob.param['input_QUERYSPECIES'], colnames=glob.param['input_COLNAMES'], block_column_name=glob.param['input_BLOCK_
COLUMN_NAME'], input_format=glob.param['input_FORMAT'])

The customized script below uses hardcoded parameters to execute the Pannzer method of function prediction starting from a set of sequences in FASTA format. On line 4, we make sure that the SANSparallel server and DictServer will be accessed remotely. The query species is defined using the -s option on the command line.

import Parameters, Runner
if __name__ == '__main__' :
   glob=Parameters.WorkSpace()
   glob.param['CONN_REMOTE']=True
   z=Runner.Runner(glob,operator_name='Pannzer')
   z.lazyRunner("--",[None, "DE.out", "GO.out", "anno.out"],queryspecies=glob.param['input_QUERYSPECIES'])

The lazyRunner method has a number of qualifiers that can be used to pass on information about the structure of the input data. As said above, the defult input format is FASTA formatted sequences (input_format="FASTA"). The other alternative of input format is tabular data (input_format="tab").

Raw sequences are automatically sent to the SANSparallel server to get the sequence neighbourhood for each query and converted to a tabular format.

Tabular data is assumed to have column names on the first row (comment lines starting with '#' are ignored). The column separator is a tab. If the input data does not have a header row, you can define column names using the colnames qualifier. It is a string of concatenated column names, separated by spaces. For example, GO predictions could have three columns defined by colnames="qpid goid score". Column order is free, as the operators identify data columns by the header string.

Depending on the nature of the operator, the input data stream can be grouped by the variable defined by block_column_name. The default is 'qpid' (query protein identifier) when the data is a list of sequence neighbours of the query protein.

Finally, lazyRunner can be given the queryspecies qualifier if the query species is not defined in the input data. The species information is used by Pannzer.

Worked example

This command applies the taxonassignment operator to user inputs, which are FASTA-formatted protein sequences.

python /data/liisa/PANNZER2/runsanspanz.py -m taxonassignment -i example_input.fasta -o ',--' 2> /dev/null

This is the input:

>scaffold100_0M.1052_C SIZE=0 RANGE=448350-449212 (+) RNA-directed DNA polymerase
YFKRIIHHNQVGFIPGLQGWFNILKSINVIQYINKRKNKNHMILSIDAEKAFDNVQHPFLIKTLQSVGIEGTYLNIIKAIYEKPTANIILNGEKLRAFPLRSGTWQGCPLSPLLFNIVLEVLVTAIRQQKIKGIIGKEEVKLSLFADDMILYVENPKDSTPKLELIQEFSKVAGYKFNAQKSIAFIYTNNKTEEREIKESIPFTIAPKTIR
YLGINLTKEAKNLYCENYKILMKAIEEDTKKWKNVPCLWIGRTNIVRMSMLPRAIYTFNEILSKYHQHFSKKWN
>scaffold100_0M.1321_A SIZE=0 RANGE=497083-497991 (+) Endonuclease (Fragment)
KNIHIDQWHRIESPDMDPQLYGQLIFSKAGKNIQWKKDSLFNKWCWENWTSIYRRMKLNHSLTPYTKINSKWMKDLNVRQEAIKVLEENIDSNLLDIGHSNFFQDMSPKARETKVKMNFWDFIKIKSFCTAKETVNKTKRQPMEWEKIFANDTTDKGLVSKIYKELLKLNTQKTNNQVKKWAEDMNRHFSEEDIQMANRHMKKCSSSLAIR
EIQIKTTLRYHLTPVRMAETDKARNNKCWRGCGERGTLLHCWWECKLVPLWKTVWSFLKKLKIELLYDPAIALLGIYPKDSDVVKRRAICT

This is the output (the numbers may change with database updates):

qpid    taxon   count   total   frequency       rank
scaffold100_0M.1052_C   Eukaryota; Metazoa      186     195     0.953846153846  1
scaffold100_0M.1321_A   Eukaryota; Metazoa      208     220     0.945454545455  1

Implementation

The taxonassigment operator (operators/taxonassigment.py) is implemented as a composite operator which calls two simpler operators. Each operator works on one or more spreadsheets. The rows of the spreadsheet represent data associated with a primary key. If you compute a summary over a group of rows, you can output the summary data to a new spreadsheet which uses the group as the new primary key. Below, we will dissect the simpler operator classes first. The first operator illustrates simple data manipulation within a spreadsheet. The second operator illustrates how you can import data from online dictionaries. Finally, the third operator computes a summary over groups of rows in the first spreadsheet and outputs the summary to a second spreadsheet.

Taxon: a simple operator

Operators create new columns in spreadsheets. This is a simple operator (operators/taxon.py) which does some string processing and demonstrates the essential elements of an operator class.

 1:from myoperator import RowOperator
 2:import sys
 3:
 4:# Clone a rowwise operator. Class name and file name (operators/classname.py) must be identical.
 5:class taxon(RowOperator) :
 6:    """Extract taxon from lineage to taxon_depth."""
 7:    def __init__(self,glob):
 8:        # define nicknames for column indices
 9:        [self.lineage_col, self.taxon_col]=glob.use_sheet("data").use_columns(["lineage","taxon"])
10:        self.taxon_depth=int(glob.param['TAXON_DEPTH'])
11:
12:    def process(self, row):
13:     x=row[self.lineage_col]
15:     tmp=x.split(";")
16:     if len(tmp) > self.taxon_depth:
17:             row[self.taxon_col]=";".join(tmp[0:self.taxon_depth])
18:     else:
19:             row[self.taxon_col]=";".join(tmp)

Line 5: this operator is a subclass of RowOperator. The other possibility is BlockOperator. Select RowOperator if each line of the input stream is processed independently. Select BlockOperator if the input stream is chopped into blocks that belong to one unit (such as a query sequence and its homologs). Note that the name of the class (taxon) must be identical to the name of the file (operators/taxon.py) so that it can be found by operator name.
Line 7: all operators are passed the glob object handle. This makes parameters, spreadsheets and dictionaries visible to the class. (Note that glob is the only argument passed to an operator when it is instantiated. Therefore, if you wanted to change the value of taxon_depth, it should be moved to glob.param.)
Line 9: Spreadsheets and columns are recognized by their names. By convention, all data from the input stream is imported into a spreadsheet called "data". This operator reads from a column named "lineage" and writes to a column named "taxon". glob.use_sheet("data") returns a handle to the spreadsheet object. The spreadsheet method use_columns returns the column indices of the named columns, and creates a new column in the spreadsheet if it did not exist before.
Line 12: every operator class must have a method called process.
Lines 13,17: this operator extracts the first few levels of the taxonomic lineage from the column named "lineage" and writes them to the column named "taxon"

Lineage operator: using dictionaries

operators/lineage.py is another simple operator with much the same anatomy as the taxon operator except that it uses a dictionary.

 1: import sys
 2: from myoperator import RowOperator
 3: 
 4: # Clone a rowwise operator. Class name and file name (operators/classname.py) must be identical.
 5: class lineage(RowOperator) :
 6:    """Do a dictionary lookup for the taxonomic lineage of a species."""
 7:     def __init__(self,glob):
 8:         self.glob=glob
 9:         # define nicknames for column indices
10:         [self.species_col, self.lineage_col]=self.glob.use_sheet("data").use_columns(["species","lineage"])
11:         # use online dictionary. Object handles in glob are hardcoded
12:         self.glob.use_online_dictionaries(["LINEAGE"])
13: 
14:     def process(self,row):
15:         species=row[self.species_col]
16:         row[self.lineage_col]=self.get_lineage(species)
17: 
18:     def get_lineage(self,species):
19:         ucspecies=species.upper() # lineage keys are UPPERCASE species
20:         if ucspecies in self.glob.lineage: return(self.glob.lineage[ucspecies])
21:         if species != "unknown": sys.stderr.write("# Warning: unknown species ||%s||\n" %species)
22:         return("unclassified")

Line 12 subscribes to the LINEAGE dictionary.

Line 20 does a lookup in the LINEAGE dictionary.

Dictionaries can be online or local. We use online dictionaries to save memory and loading time. The DictServer.lookup_key_values() method queries an online dictionary with a (table, key) tuple and it returns a (table, key, value) tuple.

Runner.lazyRunner caches data for all dictionaries below, because the same keys are likely to recur frequently in lists of homologous proteins. For LINEAGE, keys are taken from column 'species'. For GODICT, keys are taken from column 'spid' or 'qpid'. For DESCCOUNT and WORDCOUNT, keys are taken from column 'desc'. GOIDELIC downloads a copy of the entire GO structure (about 44,000 GO terms). Object handles to the cached dictionaries are defined in glob and fixed.

The available online dictionaries are the following. All keys must be uppercase.

None

Subscribe to Table name Key Value Object handle to cached data

LINEAGE LINEAGE species scientific name Taxonomic lineage of species glob.lineage

TAXID TAXID species scientific name NCBI numeric identifier of taxon glob.taxid

WORDCOUNT WORDCOUNT word Number of descriptions in uniprot that contain word glob.wordcounts

WORDCOUNT NWORDTOTAL None Total number of unique words (cleaned) in uniprot DE-lines glob.nwordtotal (constant)

DESCCOUNT DESCCOUNT uniprot description Number of identical descriptions in uniprot glob.desccounts

DESCCOUNT NPROT None Total number of protein entries in uniprot glob.nprot (constant)

GODICT GODICT uniprot accession GO identifiers assigned to uniprot entry by GOA glob.GOdict

GOIDELIC GOCOUNTS GO identifier Total number of uniprot entries assigned to GO term (propagated to parents) glob.GOcounts

GOIDELIC GOPARENTS GO identifier GO identifiers of parental line glob.GOparents

GOIDELIC GODESC GO identifier Short description of GO term glob.godesc

GOIDELIC ONTOLOGY GO identifier BP, MF or CC glob.ontology

GOIDELIC IC GO identifier Information content of GO term glob.IC

GOIDELIC EC GO identifier Inverse of ec2go glob.EC

GOIDELIC KEGG GO identifier Inverse of kegg2go glob.KEGG

You can naturally also load your own dictionary from a file using, for example, Read_and_Print.read_dict_data(). The data should be tab-separated (key,value) pairs. Attach your dictionary to the workspace object (here: glob) if it needs to be visible to other operators.

Taxonassigment: a composite operator using multiple spreadsheets

This operator calls the above simple operators to add lineage and taxon columns to the inputs. It then proceeds to compute a summary for each query sequence. In this case, it outputs the support of the most frequent taxon.

1:from myoperator import BlockOperator
2:import operator
3:
4:# Clone a blockwise operator. Class name and file name (operators/classname.py) must be identical.
5:class taxonassignment(BlockOperator) :
6:    """Outputs most frequent taxon per block."""
7:    def __init__(self,glob, maxrank=1):
8:      # parameters used by self.process()
9:      self.glob=glob
10:     self.maxrank=maxrank
11:     # define private object handle to spreadsheets (glob maps handles by name)
12:     [self.data,self.summary_data]=glob.use_sheets(["data","summary"])
13:     # define nicknames for column indices
14:     [self.qpid_col1,self.isquery_col,self.taxon_col1]=self.data.use_columns(["qpid","isquery","taxon"])
15:     self.summary_data.use_columns(["qpid","taxon","count","total","frequency","rank"])
16:     # this is a composite operator
17:     [self.a,self.b]=glob.use_operators(['lineage','taxon'])
18:
19:    def process(self, block):
20:     # call preprocessing operators
21:     for row in block:
22:                self.a.process(row) # add lineage
23:                self.b.process(row) # add taxon
24:     # get query
25:     qpid='unknown'
26:     for row in block:
27:                if row[self.isquery_col]=="1": qpid=row[self.qpid_col1] # string comparison
28:     # compute sum statistics
29:     n=0
30:     cnt={}
31:     for row in block:
32:             n+=1
33:             taxon=row[self.taxon_col1]
34:             if not taxon in cnt: cnt[taxon]=0
35:             cnt[taxon]+=1
36:     # reverse sort by counts
37:     sorted_cnt = sorted(cnt.items(), key=operator.itemgetter(1), reverse=True)
38:     # save top ranks in summary table
39:     rank=0
40:     for (taxon,cnt) in sorted_cnt:
41:             rank+=1
42:             freq=0
43:             if n>0: freq=cnt/float(n)
44:             # assume fixed column order
45:             datarow=[qpid,taxon,str(cnt),str(n),str(freq),str(rank)]
46:             self.summary_data.append_row(datarow)
47:             if rank>=self.maxrank: break

Line 12: glob.use_sheets returns object handles to spreadsheets. By convention, input data is imported to "data". Here, we create a second spreadsheet with the name "summary". If another operator class were to modify the second spreadsheet, it would be recognized by the shared name.
Lines 14-15: the column indices refer to a specific spreadsheet.
Line 17: this instantiates the lineage and taxon operators, which are part of our composite work flow.
Lines 21-23: lineage and taxon are RowOperators that are applied row-wise
Lines 26-27: this captures information about the query sequence from SANSparallel output which has its own format conventions
Line 45: spreadsheets contain only strings. Therefore you must make the appropriate format conversions after performing arithmetics.
Line 46: create a row in the second table for this query protein

Implementing and evaluating novel scoring functions for function prediction

Scoring function

The Pannzer method involves homology search, filtering and clustering of the sequence neighborhood, and scoring GO classes found in the sequence neighborhood. The scoring is done by an operator called RM3 and different variants of scoring functions are implemented as functions within the RM3 class. This way they can share basic statistical variables that are computed by the RM3.process() function. The function below implements a simple baseline function called 'SANS':

        def SANS_score(self,goid):
                "Frequency of GO classes in sequence neighborhood. Returns (raw score, PPV)."
                obs=float(self.goclasscount[goid])/self.sampleSize
                ppv=obs
                return(obs,ppv)

Scoring functions can be specified by name from the command line (--PANZ_PREDICTOR, --eval_SCOREFUNCTIONS), and the RM3.process method will then execute the appropriate xxx_score function.

Evaluation

Let's assume we have test set of protein sequences called testset.fasta and a reference of truth in GAF format (qpid, goid) called testset_truth.gaf. The reference of truth is preprocessed by propagating the direct GO term assignments to all parents in the GO hierarchy:

python runsanspanz.py -m gaf2propagated -i testset_truth.gaf -f tab -c "qpid goid" -o ",testset_truth_propagated"

Below, we save the result from homology search for reuse later with different prediction parameters. Scoring functions for prediction and evaluation are selected with the --PANZ_PREDICTOR and --eval_SCOREFUNCTIONS arguments.

python runsanspanz.py -m SANS -f FASTA -i testset.fasta -o testset.fasta.sans
python runsanspanz.py -m Pannzer -f tab -i testset.fasta.sans --PANZ_PREDICTOR "RM3,ARGOT,JAC,HYGE,SANS" -o ',,testset.fasta.sans.GO,'
python runsanspanz.py -m GOevaluation -f tab -i testset.fasta.sans.GO --eval_TRUTH testset_truth_propagated --eval_SCOREFUNCTIONS "RM3_PPV ARGOT_PPV JAC_PPV HYGE_PPV SANS_PPV" -o ',,testset.fasta.sans.GO.eval'

Summary: how to write your own SANSPANZ application

The main application script

The main application script is very short. It must instantiate a Parameters.WorkSpace object (glob) which is passed to the Runner object. Runner.lazyRunner must be informed about the format and source of input data and about the destinations of output data. Below is a minimal example.

import Parameters, Runner
if __name__ == '__main__' :
    glob=Parameters.WorkSpace()
    z=Runner.Runner(glob, operator_name='myfavourite')
    z.lazyRunner("--", ["--"],input_format="tab",colnames="col1 col2")

Operator classes

You can implement a new functionality by writing a new operator in the SANSPANZ/operators directory. The operator class is subclassed from BlockOperator or RowOperator, for example:

from myoperator import BlockOperator
class myfavourite(BlockOperator):

from myoperator import RowOperator
class myfavourite(RowOperator):

(The type TextOperator was created as an afterthought, it bypasses the main loop and passes each line of the input stream directly to the operator's process() function. operators/obo.py is a TextOperator that parses the GO hierarchy out of an OBO file.)

The __init__ method is called with a Parameters.WorkSpace() object handle, for example:

	def __init__(self,glob):

The __init__ method should define:

which spreadsheets are used, for example

	 [self.data,self.summary_data]=glob.use_sheets(["data","summary"])

	self.data=glob.use_sheet("data")

which columns are going to be processed, for example

	[self.qpid_col1,self.isquery_col,self.taxon_col1]=self.data.use_columns(["qpid","isquery","taxon"])
	[self.qpid_col2,self.taxon_col2,self.count_col,self.total_col,self.freq_col,self.rank_col]=self.summary_data.use_columns(["qpid","taxon","count","total","frequency","rank"])

	[self.lineage_col, self.taxon_col]=glob.use_sheet("data").use_columns(["lineage","taxon"])

which dictionaries are going to be used, for example
```
	glob.use_online_dictionaries(["LINEAGE","WORDCOUNT","DESCCOUNT","GOIDELIC","GODICT"])
```
or
```
	glob.load_dictionaryname()
```
where dictionaryname is one of lineage, wordcounts, desccounts, GOdict, GOdict_noIEA, goidelic, nprot or nwordtotal. The respective data file names are defined by parameters in glob.param. Small private dictionaries can be loaded from a file using the function Read_and_Print.read_dict_data()

which other operators are going to be used, for example

	[self.a,self.b]=glob.use_operators(['lineage','taxon'])

Every operator class must have a method called process, for example:

	def process(self, block):

	def process(self, row):

The process() method writes new values to the cells of the spreadsheet objects, and this is the part that is entirely up to you. The Runner object (see main application script) takes automatically care of all input and output.

The operator class can also have a finalise() method which is called after the main loop has finished. This can for example print out the total number of queries that have been processed.

Novel function prediction scores

Create a new function NEWFUNC in the RM3 operator. Its argument is a GO identifier and it must return a raw score and a normalized score between zero and one. This is the only piece of code that needs to be written. You can use the new function in prediction and evaluation by giving its name as a command line argument.
To generate predictions with the new function specify
```
-m Pannzer --PANZ_PREDICTOR NEWFUNC
```
on the command line. You can include several scoring functions in one Pannzer run.
To generate a reference of truth from a GAF file specify
```
-m gaf2propagated -i mytruth.gaf -o ",mytruth"
```
on the command line.
To evaluate predictions specify
```
-m GOevaluation --eval_SCOREFUNCTIONS NEWFUNC --eval_TRUTH mytruth
```
on the command line. You can include several scoring functions in one GOevaluation run.

Subscribe to	Table name	Key	Value	Object handle to cached data
LINEAGE	LINEAGE	species scientific name	Taxonomic lineage of species	glob.lineage
TAXID	TAXID	species scientific name	NCBI numeric identifier of taxon	glob.taxid
WORDCOUNT	WORDCOUNT	word	Number of descriptions in uniprot that contain word	glob.wordcounts
WORDCOUNT	NWORDTOTAL	None	Total number of unique words (cleaned) in uniprot DE-lines	glob.nwordtotal (constant)
DESCCOUNT	DESCCOUNT	uniprot description	Number of identical descriptions in uniprot	glob.desccounts
DESCCOUNT	NPROT	None	Total number of protein entries in uniprot	glob.nprot (constant)
GODICT	GODICT	uniprot accession	GO identifiers assigned to uniprot entry by GOA	glob.GOdict
GOIDELIC	GOCOUNTS	GO identifier	Total number of uniprot entries assigned to GO term (propagated to parents)	glob.GOcounts
GOIDELIC	GOPARENTS	GO identifier	GO identifiers of parental line	glob.GOparents
GOIDELIC	GODESC	GO identifier	Short description of GO term	glob.godesc
GOIDELIC	ONTOLOGY	GO identifier	BP, MF or CC	glob.ontology
GOIDELIC	IC	GO identifier	Information content of GO term	glob.IC
GOIDELIC	EC	GO identifier	Inverse of ec2go	glob.EC
GOIDELIC	KEGG	GO identifier	Inverse of kegg2go	glob.KEGG