- About
- Download
- User Manual
- Example Pipeline Runs
- Example Goscores Runs
- Test other evaluation metrics with ADS
- References
Abstract
GO classifiers and other methods for automatic annotations of novel sequences play an important role in modern biosciences. It is thus important to assess the quality of different GO classifiers. Evaluation of GO classifiers depends heavily on the used evaluation metrics. Still, there has been little research on the effect of different metrics on the produced method ranking. Indeed most evaluation metrics are simply borrowed from machine learning without any testing for their applicability to GO classification.
We propose a novel simple comparison of metrics, called Artificial Dilution Series. We start by selecting a set of annotations that are known a priori to be correct (referred as Truth Set here). We create multiple copies from Truth Set and introduce different amount of errors to each copy. This creates a "series" of annotation sets with the percentage of original correct annotations ("signal") decreasing from one end of the series to the other. Next we test metrics to see which of them are good at separating annotation sets at different signal levels. In addition, we test metrics with various false positive annotation sets, and show where they rank in the generated signal range.
Altogether these test datasets allow us to test various evaluation metrics, specifically on the selected Truth Set. Notice that the real life datasets often represent challenges to evaluation metrics, and Automated Function Prediction (AFP) is no exception from this rule. Test datasets can be, for example, too small or the annotation matrix can be too sparse. Using ADS allows one to test what evaluation metrics show good performance with the selected truth set.
We compared a large set of evaluation metrics with ADS, with three different datasets, revealing drastic differences between them. Especially we show how some metrics consider false positive datasets as good as 100% correct data sets and how some metrics perform poorly at separating the different error levels. This work A) shows that evaluation metrics should be tested for their performance; B) presents a software that can be used to test different metrics on real-life datasets; C) gives guide lines on what evaluation metrics perform well with Gene Ontology structure D) proposes improved versions for some well-known evaluation metrics. The presented methods are also applicable to other areas of science where evaluation of prediction results is non-trivial.
Potential Usage
Repeating the work from the manuscript
Our pipeline.pl with default parameters will replicate the work shown in the manuscript. This will also generate most of the figures and the result tables (see discussion in ADS results). This, however, consumes resources (see discussion in ADS pipeline). We recommend a lighter run first to test that all functions work.
For detailed examples see "Example Pipeline Runs" tab.
Finding the best Evaluation Metrics for your dataset
ADS pipeline can be run using in-house datasets as an input. This let's the user to compare all Evaluation Metrics described in the manuscript. Code currently assumes that the dataset contains Gene Ontology annotations. Other dataset types could be used, simply by manually defining the classes with child-ancestor relationships.
For detailed examples see "Example Pipeline Runs" tab.
Evaluating Classifiers with improved Evaluation Metrics
Some Evaluation Metrics outperform others in our analysis. Our C++ program, goscores, can be used to evaluate classifiers with any of the discussed Evaluation Metrics. Generated ranking of classifiers should be more reliable with these better performing Evaluation Metrics.
For detailed examples see "Example Goscores runs"
Evaluating in-house developed Evaluation Metrics
One main goal of this project is to ease the comparison and the development of new Evaluation Metrics. Here one would run the developed Evaluation Metrics with the datasets, created with ADS.
For detailed examples see "Test other evaluation metrics with ADS" tab.
ADS pipeline
Note that repeating our manuscript run means that you are generating 11*1000 versions for each of the input datasets and evaluating each of these with ~30 Evaluation Metrics. This sums to generating and evaluating 990 000 test datasets. Furthermore, the parallel part of the code currently generates 30 separate runs (one for each Evaluation Metric). Thus we recommend to start your analysis with a lighter test run. For detailed examples see "Example Pipeline Runs" tab.
To run ADS pipeline:
- clone ADS pipeline project from Bitbucket repository: ADS (Bitbucket)
- alternatively download the latest version as a zip file: ADS pipeline/master
- ensure that you are running Unix with Perl (v5.0.0 or later) and R (v3.5.0 or later) installed
- open pipeline.pl and set your input parameters (line 42):
- @data_list: list of GO annotation sets that are used as the starting point of ADS analysis (i.e. "True sets", T).
- $obo_file: OBO file specifying directed acyclic graph for the GO.
You can use the default obo file (data/go-2016-03.obo) supplied with the ADS repository or you can download the latest version from Gene Ontology Consortium. Please note that your input datasets should not include GO classes marked as obsolete in the input ontology. - $goa_file: tab-delimited file of associations between gene products and GO terms and associated annotation information.
You can use the default GOA association file (data/gene_association.goa_ref_uniprot.flt) or you can download GOA file that is more suitable for your analysis from Gene Ontology Annotation Database. ADS pipeline uses GOA file to estimate the information content of GO classes. For best performance ensure that your GOA file includes annotations for the majority of GO terms included in the input true sets T. Information content for terms not included in the GOA file will be extrapolated from the neighboring terms.
-
type "perl pipeline.pl --resdir my_results &> my_log &" to start the run.
This will run all pipeline steps for all evaluation metrics analyzed in the original article. Analyzed metrics are listed in the code, perl/run_elab_parallel.pl, and can be modified according to your purpose. For more thorough description of evaluated metrics please refer to the Supplementary text of the original article: ADS Supplementary. Note that ADS pipeline is a versatile environment that allows to evaluate a large spectrum of metrics many of which are not included in the default analysis (for details refer to the ADS User Manual tab).You can control which pipeline steps are performed and which metrics are evaluated with --pipev and --pipeh arguments. For details please refer to the ADS user manual tab.
ADS results
ADS results will be printed to my_results directory:- summary.xlsx
This table contains several scores for each evaluated metric. These will include Rank Correlation, RC, and False Positive Score, FPS, for each input data set T.
Keep in mind that your results will depend on the input datasets T. For these to be realistic accession of metric quality, input sets should reflect properties of datasets to which the selected metric will be applied in the future. Thus for a general-purpose metric specify several datasets with varying properties (e.g. density, annotation depth). For a more specific evaluation select a dataset that corresponds to the intended application.
RC values monitor how well the tested metric is capable to discriminate annotation sets with different amount of introduced noise while being robust against other sources of variance. Thus we are looking for metrics that have a high RC scores. We recommend metrics that have RC scores above 90-95%.
FPS values monitor how easily the tested metric can be deceived to rank False Positive annotation Sets among Artificial Prediction Sets (APS) with high signal. False positive datasets do not contain any real information and are expected to be ranked among APS's with zero signal. Thus we are looking for metrics that have low FPS scores. We recommend to avoid metrics that have FPS scores above 20%.
- Summary figures
boxplot1
Visual analysis of ADS results for six evaluation metrics: US AUC, Fmax, Smin, Resnik A, Resnik D and Lin D. Scores for AP (Artificial Prediction) sets at each ADS signal level are plotted as boxplots and scores for FP (False Positive) sets as horizontal lines. RC and FPS scores are also displayed for each metric in the lower left corner of each plot. This format is retained in all summary boxplots.boxplot2
Boxplots comparing standard AUC and AUC-PR (AUC under Precision Recall curve) metrics. For both metrics we test unstructured (US), Gene Centric (GC) and Term Centric (TC) versions.boxplot3
Boxplots for every combination between three semantics (Resnik, Lin, AJacc), in columns, and six semantic summation methods (A, B, C, D, E, F), in rows.boxplot3b
Similar to boxplot3 with summation method F excluded in order to save space.boxplot3c
Similar to boxplot3 with summation methods B and C excluded. Methods B and C have inherent weaknesses and are only included as negative controls.boxplot4
Boxplots comparing performance of Smin, SimGIC and Jaccard metrics. For each of these metrics we compare an unstructured (Smin1, SimGIC2 and US Jacc) and gene-centric (Smin2, SimGIC and GC Jacc) versions.scattered labels
Scatterplot of FPS scores plotted against RC scores for all tested metrics. Quality metrics appear at the upper left corner of the plot (high Rank Correlation and low False Positive Signal). This part is zoomed in the lower panel. Metrics are color coded: US and GC Jacc are plotted in black, Fmax, AUC and AUC-PR in red, Smin and SimGIC in green, semantic similarity metrics in blue.
-
Metric scores for AP sets
Scores assigned by each tested evaluation metric to each of the generated Artificial Prediction Set (APS) are printed to files with *.elab extension. By this extension we refer to elab C++ executable which forms the core of the ADS pipeline. Elab files have a simple format. The first few lines start with a comment character "#" with values for input parameters following. Next follow a line of numbers designating the iterated ADS error levels. The rest of the lines designate the result matrix: rows represent permutations and columns the error levels.
ADS pipeline will create a separate subdirectory in the my_results for each input dataset T. Metric scores for the AP sets generated from input dataset T will then be printed to subdirectory my_results/T/. For each metric scores are printed to a separate file with filename corresponding to the label used for that metric (metrics and metric labels are listed in the ADS Supplementary text).
-
Metric scores for FP sets
Scores assigned by tested evaluation metrics to each of the three False Positive Sets (FP sets) will be printed in tab-delimited format to files T.fps.scores.tab. One table for each input dataset T.
We do not use metric scores for FP sets directly in our evaluation. Instead, as we describe in the ADS article, these values are converted to the corresponding ADS signal levels. The maximum of these signal levels among the tested FP sets is then designated as the False Positive Signal, the FPS score. In our results we also give FPS scores separately for each of the tested FP sets in a tab-delimited format. These are printed to files T.fps.signal.tab, one table for each of the input dataset T.
-
Jaccard index between input T and permuted datasets
This output allows to monitor the amount of change introduced to the true dataset T at different ADS signal levels. One file named T.jacc is printed to the my_results directory for each input dataset T. These are tab-delimited tables with five columns:
- index of ADS error level
- index of generated AP set
- predicted GO class (from AP dataset)
- ADS error level
- Jaccard index to the closest true GO class for the same gene (from T dataset)
Contact
If you have any questions about the ADS pipeline, please do not hesitate to ask Petri Toronen or Ilja Pljusnin: firstname.lastname(at)helsinki.fi
ADS source code and data files
- All source code and data files: ADS repository
- Master branch as a zipped archive: master.zip
- Clone and checkout Master branch with git:
git clone https://plyusnin@bitbucket.org/plyusnin/ads.git
cd ads
git fetch && git checkout master
ADS User Manual
In this section you will find user manuals for the core parts of ADS pipeline.
Note, that most programs in the ADS have a short manual that can be printed with "-h|help" option.
pipeline.pl
This script is the main entry point to the ADS pipeline.
USAGE:
pipeline.pl --resdir dir -r|rmode int -s|srand int --data str --obo file --goa file --pipeh str --pipev str[-h]
Option | argument | default | description |
---|---|---|---|
--resdir | directory | res | Results are printed to this directory |
--datadir | directory | data | Generated data files are printed to this directory |
-r|rmode | int | 2 | Specifies the number of parent GO classes for the "shifting to semantic neighbors" step. This is one of the key steps in ADS generation, aiming at generating variation within ADS signal levels. The larger rmode value is the larger the variation introduced. Experimenting with this parameter suggests that values between 2 and 4 are sufficient to separate robust metrics from excessively sensitive metrics. We argue that a quality metric should not excessively fluctuate when GO classes are interchanged for semantically similar classes. Instead, a quality metric should only monitor for truly erroneous annotations: i.e. annotations that have a certain semantic distance to the correct annotations. |
-s|srand | int | 1 | Seed for the random number generator |
--data | file(list) | data/uniprot.1000.genego,data/cafa2012.genego,data/mouseFunc.genego | Comma-separated list of input annotations sets (i.e. the "true sets" T).
By default these are: CAFAI dataset, MouseFunc dataset and a sample from UniProt GOA, all supplied with ADS repository. Alternatively, specify these on line 42. |
--obo | file | data/go-2016-03.obo | OBO file specifying directed acyclic graph for the GO.
You can use the default obo file supplied with the ADS repository or you can download the latest version from Gene Ontology Consortium. Please note that your input annotation sets should not include GO classes marked as obsolete in the input ontology. |
--goa | file | data/gene_association.goa_ref_uniprot.flt | Tab-delimited file of associations between gene products and GO terms and associated annotation information.
You can use the default GOA association file or you can download GOA file that is more suitable for your analysis from Gene Ontology Annotation Database. ADS pipeline uses GOA file to estimate the information content of GO classes. For best performance ensure that your GOA file includes annotations for the majority of GO terms included in the input true sets T. Information content for terms not included in the GOA file will be extrapolated from the neighboring terms. |
--pipeh | str | All | With pipeh argument you can control which metrics are evaluated. For example to evaluate only AUC-related metrics type "--pipeh AUC", to evaluate all Smin-related metrics type "--pipeh Smin". Metrics are grouped by their properties (for details refer to ADS Supplementary). Metrics are evaluated in parallel, thus we call these steps parallel or "horizontal" steps.
Accepted values are:
|
--pipev | str | 1:2,4:10 | With pihev you can control which non-parallel or "vertical" steps are executed. ADS pipeline is divided into 10 vertical steps allowing for execution/re-execution of specific steps as needed:
|
-h|help | false | print user manual |
run_elab_parallel.pl
This script executes step 5 in the ADS pipeline: generates and scores Artificial Prediction (AP) sets. In essence this is the core step where we are generating the Artificial Dilution Series (ADS) and scoring these AP sets with the specified evaluation metrics. AP sets are generated and scored separately for each metric (with an exception of Smin-related metrics that are scored as a group). This script then is a user interface to the core C++ library doing the analysis.
USAGE:
perl/run_elab_parallel.pl -k|perm int -e|emode str -r|rmode int -s|srand int --goc_db int --goc_on int -p|pos file -n|neg file -b|obo file --ic file --ic2 file --resdir dir [--pipeh str] [--verbal] [--help]
Option | argument | description |
---|---|---|
--resdir | directory | Results are printed to this directory |
-k|perm | int | Number of permutations, i.e. number of AP sets generated at each ADS signal level |
-e|emode | str | String in "num,num,by=num" format specifying ADS error levels, e.g. "0,1,by=0.1" |
-r|rmode | int | Number of terms in the semantic neighborhood used for the "shifting to semantic neighbors" step |
-s|srand | int | Seed for the random number generator |
--goc_db | int | Number of unique GO terms in the input dataset T |
--goc_on | int | Number of unique GO terms in the input GO ontology |
-p|pos | file | GO annotation set that is used as the starting point of ADS analysis (i.e. the "True set" T) |
-n|neg | file | Negative annotations generated from T in pipeline step 4 |
-b|obo | file | OBO file specifying directed acyclic graph for the input GO ontology |
--ic | file | File containing information contant values in "GO_class\tIC_value\n" format. This is generated at pipeline step 2 from GOA file, but you can also specify your own ic file. |
--ic2 | file | File containing information content values as defined by Clark & Radivojac (2013). Smin metrics are run with both --ic and --ic2 files. ic2 file is generated from ic file at pipeline step 3. |
--pipeh | str | Horizontal part of the pipeline, i.e. subset(s) of metrics to evaluate. Accepted values are (case insensitive):
|
-v|verbal | verbal mode | |
-h|help | print user manual |
ADS C++: library
Computation intensive steps in the ADS pipeline are implemented as a collection of C++ methods. This library has many functionalities that are listed in the following table. All code and binaries are available for download.
header file | binary | pipeline step | function |
---|---|---|---|
golib.h |
| ||
gograph.h | parsing GO OBO files, manipulating GO graph, extracting neighborhoods from GO graph | ||
gorelatives.cpp | gorelatives | step 1 | extracting neighbors in the GO graph for query GO node(s) |
main.cpp | elab | step 5 | generating AP sets and scoring these with selected evaluation metric |
goscores.cpp | goscores | step 7 | scoring any GO annotation set P against a true set T with selected evaluation metric |
ADS C++: gorelatives
gorelatives is a user interface to ADS library for extracting neighbors in the GO graph for a given GO class (or a list of GO classes)
USAGE:
bin/gorelatives -b obo_file -q qid|qfile|all -r isa[,partof,regulates,altid,consider,replacedby] -d parents|children|relatives [-k max_num] [-l] [-v]
Option | argument | description |
---|---|---|
-b | file | GO OBO 1.2 file |
-q | str|file | query id(s) (without GO-qualifier). Valid options:
|
-r | str[,str[,str..]] | list of literals that specify relations/edges to search through. Valid literals:
|
-d | str | literal that specifies the direction of the search. Valid literals:
|
-k | int | maximum number of neighbors to retrieve |
-l | print namespace and name for each query node | |
-v | verbal mode: e.g. print warning if no relatives found for a given query id |
ADS C++: elab
elab generates AP sets, scores them with specified metric and prints results to a tab-delimited ASCII table.
USAGE:
bin/elab -k int -e str -r int -s int -g -p goa_true -n goa_neg -b obo_file -i ic_file -o output -j output2 -vh -m LIST=str,TH=str,SUMF1=str,SUMF2=str,SF=str,SIMF=str
Option | argument | default | description | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-k | int | 100 | Number of permutations, i.e. number of AP sets generated at each ADS error level. | ||||||||||||||||||||||||||||||||||||
-e | str | 0,1,by=0.2 | string literal specifying ADS error levels. Format: min,max,by=num[,length=num] | ||||||||||||||||||||||||||||||||||||
-r | int | 0 | the size of semantic neighborhood for the "shifting" step | ||||||||||||||||||||||||||||||||||||
-s | int | 1 | seed for the random number generator | ||||||||||||||||||||||||||||||||||||
-g | off | propagate annotations to ancestral nodes | |||||||||||||||||||||||||||||||||||||
-p | file | true GO annotation set T [gene\tGO\tscore] | |||||||||||||||||||||||||||||||||||||
-n | file | negative GO annotation set Pneg [gene\tGO\tscore] | |||||||||||||||||||||||||||||||||||||
-b | file | OBO 1.2 file | |||||||||||||||||||||||||||||||||||||
-i | file | information content file [GO\tIC] | |||||||||||||||||||||||||||||||||||||
-o | file | Output file for metric scores (elab file). Format:
| |||||||||||||||||||||||||||||||||||||
-j | file | off | output file for Jaccard distance between generated APS and T. Format [errorind\tpermind\tGO\terror\tJacc] | ||||||||||||||||||||||||||||||||||||
-v | off | verbal mode | |||||||||||||||||||||||||||||||||||||
-h | print user manual | ||||||||||||||||||||||||||||||||||||||
-m | str | data | output mode/evaluation method, valid options:
| ||||||||||||||||||||||||||||||||||||
Specifying an evaluation metric (the "-m" argument)
|
ADS C++: goscores
goscores is designed to score a given GO annotation set against a true set.
USAGE:
bin/goscores -g -p goa_pred -t goa_true -b obo_file -i ic_file -o output -vh -m LIST=str,TH=str,SUMF1=str,SUMF2=str,SF=str,SIMF=str,MINCS=int,GOC=int
Option | argument | default | description | |
---|---|---|---|---|
-g | off | propagate annotations to ancestral nodes | ||
-p | file | predicted GO annotation set P [gene\tGO\tscore] | ||
-t | file | true GO annotation set T [gene\tGO] | ||
-b | file | OBO 1.2 file | ||
-i | file | information content file [GO\tIC] | ||
-o | file | output file | ||
-m | str | A list of attribute-value-pairs defining evaluation metric. Format: LIST=str,TH=str,SUMF1=str,SUMF2=str,SF=str,SIMF=str,MINCS=int,GOC=int
For the list of accepted attributes and values please refer to the elab manual. | ||
-v | off | verbal mode | ||
-h | print user manual |
Example runs for pipeline.pl
The main user interface is in the pipeline.pl script. This can be used to repeat analysis described in the ADS manuscript and/or to repeat the same analysis with other dataset(s).
Next we will introduce pipeline calls in the order of increasing complexity and resource consumption.
Check the manual page for more information on commands.
1. Simplest run to check first steps
Simple run that checks preliminary processing steps before elab runs, excluding IC2 calculation
perl pipeline.pl --resdir test --data data/uniprot.1000.genego --pipev 1,2,4
Notes:
- This does not perform any elab runs. Only steps 1,2 and 4 are being performed.
- Resulting files are actually directed to data directory. We consider these files to be input data files.
- So nothing goes to the defined result directory, yet.
- Input will be our uniprot dataset, located to data/uniprot.1000.genego file.
1.B. Simple run that checks preliminary processing steps before elab runs, with your own data
Simple run that checks preliminary processing steps before elab runs, with your own data
perl pipeline.pl --datadir MyData/ --data path/to/data/xxx --obo path/to/data/xx.obo --goa path/to/data/xx.goa --pipev 1,2,3,4
Notes:
- Here the processed dataset will be your own data (path/to/data/xxx)
- This will be the set of correct annotations, used as a starting point for elab runs
- OBO structure will be taken from file path/to/data/xx.obo
- GO annotations will be collected from file path/to/data/xx.goa
- This is used to calculate the GO class sizes and IC scores for GO classes
- Step 3 will be very time consuming.
- Resulting files are directed to specified data directory. We set it to be MyData here.
- This way we are not overwriting files in the data dir
- Continue here!!!
2. Simple run to check elab function
Simple run that performs just one small elab run, after first test. This can be tested only if the run 1 was done.
perl pipeline.pl --resdir test --data data/uniprot.1000.genego --perms 50 --pipev 5 --pipeh Smin
Notes:
- IMPORTANT: This will work only if the 1. example was done
- Results are directed to test directory
- Input files will be ones derived from uniprot dataset (data/uniprot.1000.genego file)
- Only the elab step will be performed (--pipev 5)
- Number of generated datasets at each signal level (--perms) is set to 50. Use this only for testing. For real science use values 200 - 1000
- This run only does tests with Smin related Evaluation Metrics (--pipeh Smin)
- Therefore this only creates two parallel runs at parallelized step.
3. Check preliminary steps and do a small elab run
Simple run that checks preliminary steps (excluding IC2 calculation) and performs small elab run
perl pipeline.pl --resdir test --data data/cafa2012.genego --perms 40 --pipev 1,2,4,5 --pipeh Jacc
Notes:
- Results are again directed to test directory
- Input files will be ones derived from CAFA dataset (data/cafa2012.genego file)
- Preliminary steps (1,2,4) and elab step (5) will be performed.
- Number of generated datasets at each signal level (--perms) is set to 40. Use this only for testing. For real science use values 200 - 1000
- This run only does tests with Jaccard related Evaluation Metrics (--pipeh Jacc)
- Therefore this also creates only two parallel runs at parallelized step.
4. Test whole pipeline with one dataset
Simple run that performs preliminary steps (excluding IC2 calculation), does all analysis steps and creates result tables but excludes the figures
perl pipeline.pl --resdir test --data data/cafa2012.genego --perms 100 --pipev 1,2,4,5,6,7,9 --pipeh AUC
Notes:
- This is the first example of the whole analysis task, with preliminary steps (1,2,4), analysis steps (5,6,7) and summaries (9)
- Input files will be ones derived from CAFA dataset (data/cafa2012.genego file) in steps 1,2 and 4
- You could skip steps 1,2 and 4, if you have performed them already in previous runs for processed dataset.
- Again we skip the calculus of IC2 (step 3), as it is significantly more time consuming in our pipe.
- Analysis steps include:
- Elab runs (step 5)
- Generation False Positive Sets (step 6)
- Scoring False Positive Sets (step 7)
- Summaries of the results are generated with steps 7 and 9.
- We skip here step 10, as it requires results from all the compared evaluation metrics.
- Here we limited the analysis only to AUC evaluation metrics
- Results are again directed to test directory. This will overwrite previous result tables, if those exist
- Number of generated datasets at each signal level (--perms) is set to 100.
- This run only does tests Area Under Curve related Evaluation Metrics (--pipeh AUC)
- This generates 7 parallel runs
5. Whole manuscript pipeline with less repeats
Repeat manuscript work, but with a significantly smaller number of repeats (=100)
perl pipeline.pl --resdir test_p_100 --perms 100
Notes:
- Here results are directed to folder test_p_100
- In principle steps 1,2 and 4 could be omitted, if they were already done with all datasets
- Perms parameter again sets the number of repeats (here 100)
- This code processes all three datasets (discussed in manuscript)
- Therefore large part of the code is repeated three times
- Now parallel part will generate 30 parallel runs
6. Repeat the manuscript work
Repeat manuscript work with default number of repeats (1000 repeats !)
perl pipeline.pl --resdir repeat_manuscript
Notes:
- Here we direct the results to folder repeat_manuscript
- This code processes all three datasets.
- This run will generate 30 parallel runs. Each run will process 11*1000 generated datasets.
- These are further repeated three times (=three input datasets)
- Therefore, make sure you have enough computing resources for this run.
Example runs for goscore program
Here we demonstrate how to calculate different evaluation metrics using goscores executable. Notice that using goscores requires other data files besides the predicted set of GO classes and correct set of GO classes. These can be obtained by running our pipeline.pl script with parameter -pipev 1,2,3 in following manner:
perl pipeline.pl --datadir MyData/ --obo path/to/data/xx.obo --goa path/to/data/xx.goa --pipev 1,2,3
NOTES
- All the final and temporary data files are now directed to folder MyData
- Remember to use these files in your call to the goscores executable.
- OBO structure will be taken from file path/to/data/xx.obo
- GO annotations will be taken from file path/to/data/xx.goa
Notice that although these steps and especially step 3 will take time, you need to run them only once for each set of GO annotations (GOA file). Below is an demo where we use obo and goa files that we distribute with ADS depository.
perl pipeline.pl --datadir MyData/ --obo data/go-2016-03.obo --goa data/gene_association.goa_ref_uniprot.flt --pipev 1,2,3
1. Prototype call for goscores program
Here is the prototype call:
bin/goscores -p path/to/data/pred.set -t path/to/data/corr.set -b path/to/data/obo.file -i path/to/data/ic_file -m $mode 1> res.file
NOTES:
- pred.set Evaluated set of GO predictions
- corr.set Correct GO predictions for a set of genes
- obo.file Obo file containing go structure
- ic_file Information Content (IC) scores for each GO class
- res.file Result file
- $mode this is the very string that defines the evaluation metric (discussed below)
Here is a demo call that uses uniprot.1000 dataset as correct set. All the IC scores and parent lists are predefined in the first pipeline run. Here the prediction set is left open (xxx). This would be the results from classifier that performs prediction for the genes in the correct set (data/uniprot.1000.genego)
bin/goscores -p xxx -t data/uniprot.1000.genego -b data/go-2016-03.obo -g -i MyData/ic.tab -m LIST=gene,TH=all,SUMF1=mean,SF=AUCPR 1> res.file
NOTES:
- correct result is data/uniprot.1000.genego
- predicted set is xxx
- obo file is data/go-2016-03.obo
- IC file is MyData/ic.tab
- mode is LIST=gene,TH=all,SUMF1=mean,SF=AUCPR (Term Centric AUC-PR)
2. Mode parameter in goscores
Goscores program allows you to calculate many evaluation metrics with different settings. These metrics and their variations are defined in goscores with mode parameter. Below we show in detail how to define different evaluation metrics using the mode parameter. Overview of the evaluation metrics, tested in the manuscript, is found here. More information on these various evaluation metrics can be found in our preprint and in our supplementary text.
Defining mode parameter for AUC evaluation metrics
Here is the demo call for AUC and Fmax metrics. These require the propagation to ancestor classes (-g). These commands do not require IC values
bin/goscores -p xxx -t data/uniprot.1000.genego -g -b data/go-2016-03.obo -m $mode
Here we explain how the mode parameter is used to call different Area Under Curve (AUC) evaluation metrics. So this is the string that should be given after "-m" in the command line.
abbreviation | description | mode string |
---|---|---|
US AUC | Unstructured AUC | LIST=gene_go,SF=AUC3,GOC=$goc_on |
GC AUC | Gene Centric AUC | LIST=go,TH=all,SUMF1=mean,SF=AUC3,GOC=$goc_on |
TC AUC | Term Centric AUC | LIST=gene,TH=all,SUMF1=mean,SF=AUC3 |
US AUCPR | Unstructured AUC-PR | LIST=gene_go,SF=AUCPR |
GC AUCPR | Gene Centric AUC-PR | LIST=go,TH=all,SUMF1=mean,SF=AUCPR |
TC AUCPR | Term Centric AUC-PR | LIST=gene,TH=all,SUMF1=mean,SF=AUCPR |
Defining mode parameter for Fmax evaluation metrics
Here we explain how the mode parameter is used to call Fmax evaluation metric. Our manuscript only tested the Gene Centric Fmax.
abbreviation | description | mode string | Fmax | Gene Centric Fmax | LIST=go,SF=Fmax |
---|
Defining mode parameter for group semantic measures
Here is the demo call group semantic measures. These are Smin, SimGIC, plain Jaccard and their variations. These require the propagation to ancestor classes (-g). These also require IC values (-i filename). With the Smin and SimGIC commands below you can use either IC or IC2 values.
bin/goscores -p xxx -t data/uniprot.1000.genego -g -b data/go-2016-03.obo -i MyData/ic.tab MyData/goparents.tab -m $mode
Here we explain how mode parameter is used to call group semantic measures
abbreviation | description | mode string |
---|---|---|
Smin1 | Unstructured Smin | SF=Smin1 |
Smin2 | Gene Centric Smin | SF=Smin2 |
Smin3 | Variation on Smin1 | SF=Smin3 |
SimGIC | Gene Centric wJacc | SF=SimGIC |
SimGIC2 | Unstructured wJacc | SF=SimGIC2 |
SimAll | All 5 scores above | SF=SminAll |
GC Jacc | Gene Centric Jacc | LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=Jacc |
US Jacc | Unstructured Jacc | LIST=gene_go,TH=th,SUMF=max,SF=Jacc |
Defining mode parameter for semantic similarities
Here is the demo call for semantic similarities. You should NOT use propagation to parents with these metrics. You must use IC values with these metrics. However, you should NOT use IC2 values with these metrics.
bin/goscores -p xxx -t data/uniprot.1000.genego -b data/go-2016-03.obo -i MyData/ic.tab MyData/goparents.tab -m $mode
Here we explain how the mode parameter is used to call different versions of Lin semantic similarity. Here A, B, C, D, E and F represent different ways how semantic similarities could be combined (see explanations here.
Lin A | Lin score with A | LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_A,SIMF=Lin |
---|---|---|
Lin B | Lin score with B | LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_B,SIMF=Lin |
Lin C | Lin score with C | LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_C,SIMF=Lin |
Lin D | Lin score with D | LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_D,SIMF=Lin |
Lin E | Lin score with E | LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_E,SIMF=Lin |
Lin F | Lin score with F | LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_F,SIMF=Lin |
Here we show Resnik semantic similarity and parental Jaccard (Ancestral Jaccard) with semantic similarity summary method A. Naturally you can run these semantics with other similarity summary methods.
Resnik A | Resnik score with A | LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_A,SIMF=Resnik |
---|---|---|
AJacc A | Ancestral Jacc with A | LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_A,SIMF=pJacc |
Using ADS datasets to test other evaluation metrics
Here we represent how one can use ADS method to test the other evaluation metrics. This could be also in-house developed evaluation metric that you want to test for its performance.
1. Create your own ADS datasets
Here is an example pipeline run that generates 20 ADS test datasets at each noise level:
perl pipeline.pl --resdir test/ADS_data/ --data data/uniprot.1000.genego --perms 20 --pipev 5 --pipeh data
NOTES
- This code assumes that you have already done the steps 1,2,3 and 4 for the used dataset.
- Gold Standard data here is data/uniprot.1000
- Resulting files are here test/ADS_data/uniprot.1000
- Each file has comment rows at the beginning (starting with # character)
- All 20 datasets, from the same noise level, are within the same file
- Each dataset is identified by the running number at the beginning of the row
- So all the rows that belong, say to first dataset, will start with number 0
- Next datasets will be identified with consecutive integers (1,2,3,4 ...)
For real analysis we recommend selecting perms parameter between 200 - 1000.
2. Extract one generated dataset from one result file
Next one has to extract each separate dataset from one ADS file
You can extract, for example the first dataset from each file, using a following type of command:
grep "^0[^0-9]" uniprot.1000.genego.sc.elab.c0 > subset.c0
Here the first number in the grep command defines which dataset you will select
This grep command will include the running number at the beginning of the row
You can also extract the rows so that the identifying integer and the first tab are removed
sed -n -e 's/^0\t//p' uniprot.1000.genego.sc.elab.c0 > subset.c0
Again the first number in the command defines which dataset you will select
Now you can use subset.c0 as an input to your evaluation metric programs
3. Create False Positive Sets
False Positive Sets (FP sets) are another important test for evaluation metrics. Our manuscript shows that this reveals different weaknesses besides ADS test. We have a separate function for generating these:
COMMANDS COMING SOON ...
Citing ADS project
To cite ADS please cite this article:
Plyusnin, Ilya, Liisa Holm, and Petri Törönen. "Novel comparison of evaluation metrics for gene ontology classifiers reveals drastic performance differences." PLoS computational biology 15.11 (2019). [PMID: 31682632]
Erratum for the main article
Table 1 contains errors in column marking the recommended metrics. Our strongest recommendation, ic SimGIC2, is erroneously marked as only "potential recommendation". Two potential recommendations are missed: ic SimGIC2 and ic2 Smin1. And one erroneous recommendation included: AJACC E.
The corrected Table 1 is available here.
Supplementary files
Most relevant supplementary files are listed here:Code and data
All code and data files are available at the main ADS repository