Abstract

GO classifiers and other methods for automatic annotations of novel sequences play an important role in modern biosciences. It is thus important to assess the quality of different GO classifiers. Evaluation of GO classifiers depends heavily on the used evaluation metrics. Still, there has been little research on the effect of different metrics on the produced method ranking. Indeed most evaluation metrics are simply borrowed from machine learning without any testing for their applicability to GO classification.

We propose a novel simple comparison of metrics, called Artificial Dilution Series. We start by selecting a set of annotations that are known a priori to be correct (referred as Truth Set here). We create multiple copies from Truth Set and introduce different amount of errors to each copy. This creates a "series" of annotation sets with the percentage of original correct annotations ("signal") decreasing from one end of the series to the other. Next we test metrics to see which of them are good at separating annotation sets at different signal levels. In addition, we test metrics with various false positive annotation sets, and show where they rank in the generated signal range.

Altogether these test datasets allow us to test various evaluation metrics, specifically on the selected Truth Set. Notice that the real life datasets often represent challenges to evaluation metrics, and Automated Function Prediction (AFP) is no exception from this rule. Test datasets can be, for example, too small or the annotation matrix can be too sparse. Using ADS allows one to test what evaluation metrics show good performance with the selected truth set.

We compared a large set of evaluation metrics with ADS, with three different datasets, revealing drastic differences between them. Especially we show how some metrics consider false positive datasets as good as 100% correct data sets and how some metrics perform poorly at separating the different error levels. This work A) shows that evaluation metrics should be tested for their performance; B) presents a software that can be used to test different metrics on real-life datasets; C) gives guide lines on what evaluation metrics perform well with Gene Ontology structure D) proposes improved versions for some well-known evaluation metrics. The presented methods are also applicable to other areas of science where evaluation of prediction results is non-trivial.

Potential Usage

Repeating the work from the manuscript

Our pipeline.pl with default parameters will replicate the work shown in the manuscript. This will also generate most of the figures and the result tables (see discussion in ADS results). This, however, consumes resources (see discussion in ADS pipeline). We recommend a lighter run first to test that all functions work.

For detailed examples see "Example Pipeline Runs" tab.

Finding the best Evaluation Metrics for your dataset

ADS pipeline can be run using in-house datasets as an input. This let's the user to compare all Evaluation Metrics described in the manuscript. Code currently assumes that the dataset contains Gene Ontology annotations. Other dataset types could be used, simply by manually defining the classes with child-ancestor relationships.

For detailed examples see "Example Pipeline Runs" tab.

Evaluating Classifiers with improved Evaluation Metrics

Some Evaluation Metrics outperform others in our analysis. Our C++ program, goscores, can be used to evaluate classifiers with any of the discussed Evaluation Metrics. Generated ranking of classifiers should be more reliable with these better performing Evaluation Metrics.

For detailed examples see "Example Goscores runs"

Evaluating in-house developed Evaluation Metrics

One main goal of this project is to ease the comparison and the development of new Evaluation Metrics. Here one would run the developed Evaluation Metrics with the datasets, created with ADS.

For detailed examples see "Test other evaluation metrics with ADS" tab.

ADS pipeline

Note that repeating our manuscript run means that you are generating 11*1000 versions for each of the input datasets and evaluating each of these with ~30 Evaluation Metrics. This sums to generating and evaluating 990 000 test datasets. Furthermore, the parallel part of the code currently generates 30 separate runs (one for each Evaluation Metric). Thus we recommend to start your analysis with a lighter test run. For detailed examples see "Example Pipeline Runs" tab.

To run ADS pipeline:

  1. clone ADS pipeline project from Bitbucket repository: ADS (Bitbucket)
  2. alternatively download the latest version as a zip file: ADS pipeline/master
  3. ensure that you are running Unix with Perl (v5.0.0 or later) and R (v3.5.0 or later) installed
  4. open pipeline.pl and set your input parameters (line 42):
    • @data_list: list of GO annotation sets that are used as the starting point of ADS analysis (i.e. "True sets", T).
    • $obo_file: OBO file specifying directed acyclic graph for the GO.
      You can use the default obo file (data/go-2016-03.obo) supplied with the ADS repository or you can download the latest version from Gene Ontology Consortium. Please note that your input datasets should not include GO classes marked as obsolete in the input ontology.
    • $goa_file: tab-delimited file of associations between gene products and GO terms and associated annotation information.
      You can use the default GOA association file (data/gene_association.goa_ref_uniprot.flt) or you can download GOA file that is more suitable for your analysis from Gene Ontology Annotation Database. ADS pipeline uses GOA file to estimate the information content of GO classes. For best performance ensure that your GOA file includes annotations for the majority of GO terms included in the input true sets T. Information content for terms not included in the GOA file will be extrapolated from the neighboring terms.
  5. type "perl pipeline.pl --resdir my_results &> my_log &" to start the run.
    This will run all pipeline steps for all evaluation metrics analyzed in the original article. Analyzed metrics are listed in the code, perl/run_elab_parallel.pl, and can be modified according to your purpose. For more thorough description of evaluated metrics please refer to the Supplementary text of the original article: ADS Supplementary. Note that ADS pipeline is a versatile environment that allows to evaluate a large spectrum of metrics many of which are not included in the default analysis (for details refer to the ADS User Manual tab).

    You can control which pipeline steps are performed and which metrics are evaluated with --pipev and --pipeh arguments. For details please refer to the ADS user manual tab.

ADS results

ADS results will be printed to my_results directory:
  1. summary.xlsx

    This table contains several scores for each evaluated metric. These will include Rank Correlation, RC, and False Positive Score, FPS, for each input data set T.

    Keep in mind that your results will depend on the input datasets T. For these to be realistic accession of metric quality, input sets should reflect properties of datasets to which the selected metric will be applied in the future. Thus for a general-purpose metric specify several datasets with varying properties (e.g. density, annotation depth). For a more specific evaluation select a dataset that corresponds to the intended application.

    RC values monitor how well the tested metric is capable to discriminate annotation sets with different amount of introduced noise while being robust against other sources of variance. Thus we are looking for metrics that have a high RC scores. We recommend metrics that have RC scores above 90-95%.

    FPS values monitor how easily the tested metric can be deceived to rank False Positive annotation Sets among Artificial Prediction Sets (APS) with high signal. False positive datasets do not contain any real information and are expected to be ranked among APS's with zero signal. Thus we are looking for metrics that have low FPS scores. We recommend to avoid metrics that have FPS scores above 20%.

  2. Summary figures
    1. boxplot1
      Visual analysis of ADS results for six evaluation metrics: US AUC, Fmax, Smin, Resnik A, Resnik D and Lin D. Scores for AP (Artificial Prediction) sets at each ADS signal level are plotted as boxplots and scores for FP (False Positive) sets as horizontal lines. RC and FPS scores are also displayed for each metric in the lower left corner of each plot. This format is retained in all summary boxplots.

    2. boxplot2
      Boxplots comparing standard AUC and AUC-PR (AUC under Precision Recall curve) metrics. For both metrics we test unstructured (US), Gene Centric (GC) and Term Centric (TC) versions.

    3. boxplot3
      Boxplots for every combination between three semantics (Resnik, Lin, AJacc), in columns, and six semantic summation methods (A, B, C, D, E, F), in rows.

    4. boxplot3b
      Similar to boxplot3 with summation method F excluded in order to save space.

    5. boxplot3c
      Similar to boxplot3 with summation methods B and C excluded. Methods B and C have inherent weaknesses and are only included as negative controls.

    6. boxplot4
      Boxplots comparing performance of Smin, SimGIC and Jaccard metrics. For each of these metrics we compare an unstructured (Smin1, SimGIC2 and US Jacc) and gene-centric (Smin2, SimGIC and GC Jacc) versions.

    7. scattered labels
      Scatterplot of FPS scores plotted against RC scores for all tested metrics. Quality metrics appear at the upper left corner of the plot (high Rank Correlation and low False Positive Signal). This part is zoomed in the lower panel. Metrics are color coded: US and GC Jacc are plotted in black, Fmax, AUC and AUC-PR in red, Smin and SimGIC in green, semantic similarity metrics in blue.

  3. Metric scores for AP sets

    Scores assigned by each tested evaluation metric to each of the generated Artificial Prediction Set (APS) are printed to files with *.elab extension. By this extension we refer to elab C++ executable which forms the core of the ADS pipeline. Elab files have a simple format. The first few lines start with a comment character "#" with values for input parameters following. Next follow a line of numbers designating the iterated ADS error levels. The rest of the lines designate the result matrix: rows represent permutations and columns the error levels.

    ADS pipeline will create a separate subdirectory in the my_results for each input dataset T. Metric scores for the AP sets generated from input dataset T will then be printed to subdirectory my_results/T/. For each metric scores are printed to a separate file with filename corresponding to the label used for that metric (metrics and metric labels are listed in the ADS Supplementary text).

  4. Metric scores for FP sets

    Scores assigned by tested evaluation metrics to each of the three False Positive Sets (FP sets) will be printed in tab-delimited format to files T.fps.scores.tab. One table for each input dataset T.

    We do not use metric scores for FP sets directly in our evaluation. Instead, as we describe in the ADS article, these values are converted to the corresponding ADS signal levels. The maximum of these signal levels among the tested FP sets is then designated as the False Positive Signal, the FPS score. In our results we also give FPS scores separately for each of the tested FP sets in a tab-delimited format. These are printed to files T.fps.signal.tab, one table for each of the input dataset T.

  5. Jaccard index between input T and permuted datasets

    This output allows to monitor the amount of change introduced to the true dataset T at different ADS signal levels. One file named T.jacc is printed to the my_results directory for each input dataset T. These are tab-delimited tables with five columns:

    1. index of ADS error level
    2. index of generated AP set
    3. predicted GO class (from AP dataset)
    4. ADS error level
    5. Jaccard index to the closest true GO class for the same gene (from T dataset)
    One line of values is printed for each annotation line in all generated AP sets.

Contact

If you have any questions about the ADS pipeline, please do not hesitate to ask Petri Toronen or Ilja Pljusnin: firstname.lastname(at)helsinki.fi





ADS source code and data files

  • All source code and data files: ADS repository

  • Master branch as a zipped archive: master.zip

  • Clone and checkout Master branch with git:
    git clone https://plyusnin@bitbucket.org/plyusnin/ads.git
    cd ads
    git fetch && git checkout master




ADS User Manual

In this section you will find user manuals for the core parts of ADS pipeline.
Note, that most programs in the ADS have a short manual that can be printed with "-h|help" option.

pipeline.pl

This script is the main entry point to the ADS pipeline.

USAGE:
pipeline.pl --resdir dir -r|rmode int -s|srand int --data str --obo file --goa file --pipeh str --pipev str[-h]


Option argument default description
--resdir directory res Results are printed to this directory
--datadir directory data Generated data files are printed to this directory
-r|rmode int 2 Specifies the number of parent GO classes for the "shifting to semantic neighbors" step. This is one of the key steps in ADS generation, aiming at generating variation within ADS signal levels. The larger rmode value is the larger the variation introduced. Experimenting with this parameter suggests that values between 2 and 4 are sufficient to separate robust metrics from excessively sensitive metrics. We argue that a quality metric should not excessively fluctuate when GO classes are interchanged for semantically similar classes. Instead, a quality metric should only monitor for truly erroneous annotations: i.e. annotations that have a certain semantic distance to the correct annotations.
-s|srand int 1 Seed for the random number generator
--data file(list) data/uniprot.1000.genego,data/cafa2012.genego,data/mouseFunc.genego Comma-separated list of input annotations sets (i.e. the "true sets" T).
By default these are: CAFAI dataset, MouseFunc dataset and a sample from UniProt GOA, all supplied with ADS repository. Alternatively, specify these on line 42.
--obo file data/go-2016-03.obo OBO file specifying directed acyclic graph for the GO.
You can use the default obo file supplied with the ADS repository or you can download the latest version from Gene Ontology Consortium. Please note that your input annotation sets should not include GO classes marked as obsolete in the input ontology.
--goa file data/gene_association.goa_ref_uniprot.flt Tab-delimited file of associations between gene products and GO terms and associated annotation information.
You can use the default GOA association file or you can download GOA file that is more suitable for your analysis from Gene Ontology Annotation Database. ADS pipeline uses GOA file to estimate the information content of GO classes. For best performance ensure that your GOA file includes annotations for the majority of GO terms included in the input true sets T. Information content for terms not included in the GOA file will be extrapolated from the neighboring terms.
--pipeh str All With pipeh argument you can control which metrics are evaluated. For example to evaluate only AUC-related metrics type "--pipeh AUC", to evaluate all Smin-related metrics type "--pipeh Smin". Metrics are grouped by their properties (for details refer to ADS Supplementary). Metrics are evaluated in parallel, thus we call these steps parallel or "horizontal" steps.

Accepted values are:

  • AUC : evaluate AUC/AUC-PR metrics + Fmax
  • Jacc : evaluate metrics based on Jaccard index
  • Smin : evaluate Smin-related metrics
  • Sem : evaluate Semantic similarity metrics
  • a combination of above, e.g. --pipeh AUC,Jacc,Smin
  • All : evaluate all metrics evaluated in the ADS article
  • data : generate just the datasets

--pipev str 1:2,4:10 With pihev you can control which non-parallel or "vertical" steps are executed. ADS pipeline is divided into 10 vertical steps allowing for execution/re-execution of specific steps as needed:
  1. extract sets of ancestral classes for all GO classes in ontology, save to data/goparents.tab
  2. estimate information content values from GO class frequencies in GOA file, save to data/ic.tab and data/ic.ext.tab
  3. estimate information content values as defined by Clark & Radivojac (2013), save to data/ic2.tab and data/ic2.ext.tab
  4. sample negative annotation set from T + assign prediction scores
  5. generate AP sets and score these with selected evaluation metrics (elab runs)
  6. generate FP sets
  7. score FP sets against AP sets
  8. calc Jaccard index for a sample of AP sets
  9. print summary tables
  10. print summary figures
-h|help false print user manual

run_elab_parallel.pl

This script executes step 5 in the ADS pipeline: generates and scores Artificial Prediction (AP) sets. In essence this is the core step where we are generating the Artificial Dilution Series (ADS) and scoring these AP sets with the specified evaluation metrics. AP sets are generated and scored separately for each metric (with an exception of Smin-related metrics that are scored as a group). This script then is a user interface to the core C++ library doing the analysis.

USAGE:
perl/run_elab_parallel.pl -k|perm int -e|emode str -r|rmode int -s|srand int --goc_db int --goc_on int -p|pos file -n|neg file -b|obo file --ic file --ic2 file --resdir dir [--pipeh str] [--verbal] [--help]


Option argument description
--resdir directory Results are printed to this directory
-k|perm int Number of permutations, i.e. number of AP sets generated at each ADS signal level
-e|emode str String in "num,num,by=num" format specifying ADS error levels, e.g. "0,1,by=0.1"
-r|rmode int Number of terms in the semantic neighborhood used for the "shifting to semantic neighbors" step
-s|srand int Seed for the random number generator
--goc_db int Number of unique GO terms in the input dataset T
--goc_on int Number of unique GO terms in the input GO ontology
-p|pos file GO annotation set that is used as the starting point of ADS analysis (i.e. the "True set" T)
-n|neg file Negative annotations generated from T in pipeline step 4
-b|obo file OBO file specifying directed acyclic graph for the input GO ontology
--ic file File containing information contant values in "GO_class\tIC_value\n" format. This is generated at pipeline step 2 from GOA file, but you can also specify your own ic file.
--ic2 file File containing information content values as defined by Clark & Radivojac (2013). Smin metrics are run with both --ic and --ic2 files. ic2 file is generated from ic file at pipeline step 3.
--pipeh str Horizontal part of the pipeline, i.e. subset(s) of metrics to evaluate. Accepted values are (case insensitive):
  • AUC : evaluate AUC/AUC-PR metrics + Fmax
  • Jacc : evaluate metrics based on Jaccard index
  • Smin : evaluate Smin-related metrics
  • Sem : evaluate Semantic similarity metrics
  • a combination of above, e.g. --pipeh AUC,Jacc,Smin
  • All : evaluate all metrics evaluated in the ADS article
  • data : generate just the ADS datasets
-v|verbal verbal mode
-h|help print user manual

ADS C++: library

Computation intensive steps in the ADS pipeline are implemented as a collection of C++ methods. This library has many functionalities that are listed in the following table. All code and binaries are available for download.


header file binary pipeline step function
golib.h
  • reading, writing and manipulating GO annotation datasets
  • calculating a variety of evaluation metrics:
    1. ROC AUC metrics
    2. AUC-PR metrics
    3. metrics based on Jaccard index
    4. semantic similarity metrics including Resnik, Lin, AJacc and various combinations
    5. Smin, SimGIC and similar metrics
gograph.h parsing GO OBO files, manipulating GO graph, extracting neighborhoods from GO graph
gorelatives.cpp gorelativesstep 1 extracting neighbors in the GO graph for query GO node(s)
main.cpp elab step 5 generating AP sets and scoring these with selected evaluation metric
goscores.cpp goscores step 7 scoring any GO annotation set P against a true set T with selected evaluation metric

ADS C++: gorelatives

gorelatives is a user interface to ADS library for extracting neighbors in the GO graph for a given GO class (or a list of GO classes)

USAGE:
bin/gorelatives -b obo_file -q qid|qfile|all -r isa[,partof,regulates,altid,consider,replacedby] -d parents|children|relatives [-k max_num] [-l] [-v]


Option argument description
-b file GO OBO 1.2 file
-q str|file query id(s) (without GO-qualifier). Valid options:
  • a single query id
  • a list of query ids in a file (one per line)
  • literal "all" to print neighbors for all GO classes in ontology
-r str[,str[,str..]] list of literals that specify relations/edges to search through. Valid literals:
  • isa
  • partof
  • regulates
  • altid
  • consider
  • replacedby
-d str literal that specifies the direction of the search. Valid literals:
  • parents : iterate nodes towards the root of the graph
  • children : iterate nodes towards the leaves of the graph
  • relatives : iterate nodes in both directions
-k int maximum number of neighbors to retrieve
-l print namespace and name for each query node
-v verbal mode: e.g. print warning if no relatives found for a given query id

ADS C++: elab

elab generates AP sets, scores them with specified metric and prints results to a tab-delimited ASCII table.

USAGE:
bin/elab -k int -e str -r int -s int -g -p goa_true -n goa_neg -b obo_file -i ic_file -o output -j output2 -vh -m LIST=str,TH=str,SUMF1=str,SUMF2=str,SF=str,SIMF=str


Option argument default description
-k int 100 Number of permutations, i.e. number of AP sets generated at each ADS error level.
-e str 0,1,by=0.2 string literal specifying ADS error levels. Format: min,max,by=num[,length=num]
-r int 0 the size of semantic neighborhood for the "shifting" step
-s int 1 seed for the random number generator
-g off propagate annotations to ancestral nodes
-p file true GO annotation set T [gene\tGO\tscore]
-n file negative GO annotation set Pneg [gene\tGO\tscore]
-b file OBO 1.2 file
-i file information content file [GO\tIC]
-o file Output file for metric scores (elab file). Format:
  1. First few lines start with a comment character "#". These lines contain input arguments.
  2. Next line contains ADS error levels
  3. The rest of the file contains metric scores in a matrix: rows represent permutations and columns error levels.
-j file off output file for Jaccard distance between generated APS and T. Format [errorind\tpermind\tGO\terror\tJacc]
-v off verbal mode
-h print user manual
-m str data output mode/evaluation method, valid options:
  • data : print generated AP sets
  • LIST=str,TH=str,SUMF1=str,SUMF2=str,SF=str,SIMF=str,MINCS=int,GOC=int : score AP sets with a given evaluation metric

Specifying an evaluation metric (the "-m" argument)

attribute options default description
LIST gene|go|gene_go go
  1. gene : term-centric (TC) metric
  2. go : gene-centric (GC) metric
  3. gene_go : unstructured (US) metric
TH all|geneth|th th
  1. all : for each gene evaluates ALL predictions against true (no threshold), summarize scores across genes with SUMF1
  2. geneth : for each gene evaluates predictions of all possible lengths (1st, 1st & 2nd, .., all), summarize scores first across thresholds (SUMF2), then across genes (SUMF1)
  3. th : like in CAFAI iterate from min score to max score in N steps, summarize scores first across genes (SUMF1), then across thresholds (SUMF2)
SUMF1 max|mean|median mean function for summarizing scores across genes (GC metrics) or GO terms (TC metrics)
SUMF2 max|mean|median max function for summarizing scores across thresholds
SF metric name Jacc select one of the following core functions. In the following definitions P = predicted annotations, T = true annotations
  1. U : Mann-Whitney U-statistic
  2. AUC3 : Area Under ROC curve (ROC AUC), calculated as U/(|P|+|T|)
  3. AUCPR : Area Under Precison-Recall curve, calculated by trapezoidal rule
  4. Jacc : Jaccard index between P and T
  5. wJacc : Jaccard index weighted by information content
  6. Fmax : maximum harmonious mean for precision and recall, maximum over thresholds
  7. Smin : calculates a collection of metrics related to Smin
    1. Smin1 : Smin metric (CAFAII)
    2. Smin2 : gene-centric version of Smin1
    3. Smin3 : Smin1 x genen
    4. SimUI : unweighted Jaccard index, identical to Jacc
    5. SimGIC : ic weighted Jaccard index (gene-centric)
    6. SimGIC2 : unstructured version of SimGIC
  8. score_A : mean value of SIM*(P,T)
  9. score_B : mean value of column maxima of SIM(P,T)
  10. score_C : mean value of row maxima of SIM(P,T)
  11. score_D : mean value of methods B and C
  12. score_E : min value of methods B and C
  13. score_F : mean value of pooled column and row maxima of SIM(P,T)
    *SIM, similarity matrix with rows representing predicted GO classes and columns true GO classes
SIMF semantic similarity compute similarity matrix with one of the following core functions
  1. pJacc : Jaccard index between ancestral nodes of P and T, also referred as AJacc
  2. Resnik : Resnik (1995)
  3. Lin : Lin (1998)
  4. PathLin : Koskinen et al (2014)
  5. JC : Jiang & Conrath (1997)
  6. id : identity function (baseline control)
MINCS int 10 Minimum accepted size for GO classes in P dataset
GOC int Number of unique GO classes in input ontology (applied to AUC metrics)

ADS C++: goscores

goscores is designed to score a given GO annotation set against a true set.

USAGE:
bin/goscores -g -p goa_pred -t goa_true -b obo_file -i ic_file -o output -vh -m LIST=str,TH=str,SUMF1=str,SUMF2=str,SF=str,SIMF=str,MINCS=int,GOC=int


Option argument default description
-g off propagate annotations to ancestral nodes
-p file predicted GO annotation set P [gene\tGO\tscore]
-t file true GO annotation set T [gene\tGO]
-b file OBO 1.2 file
-i file information content file [GO\tIC]
-o file output file
-m str A list of attribute-value-pairs defining evaluation metric. Format: LIST=str,TH=str,SUMF1=str,SUMF2=str,SF=str,SIMF=str,MINCS=int,GOC=int
For the list of accepted attributes and values please refer to the elab manual.
-v off verbal mode
-h print user manual




Example runs for pipeline.pl

The main user interface is in the pipeline.pl script. This can be used to repeat analysis described in the ADS manuscript and/or to repeat the same analysis with other dataset(s).

Next we will introduce pipeline calls in the order of increasing complexity and resource consumption.

Check the manual page for more information on commands.

1. Simplest run to check first steps

Simple run that checks preliminary processing steps before elab runs, excluding IC2 calculation

perl pipeline.pl --resdir test --data data/uniprot.1000.genego --pipev 1,2,4

Notes:

  • This does not perform any elab runs. Only steps 1,2 and 4 are being performed.
  • Resulting files are actually directed to data directory. We consider these files to be input data files.
  • So nothing goes to the defined result directory, yet.
  • Input will be our uniprot dataset, located to data/uniprot.1000.genego file.

1.B. Simple run that checks preliminary processing steps before elab runs, with your own data

Simple run that checks preliminary processing steps before elab runs, with your own data

perl pipeline.pl --datadir MyData/ --data path/to/data/xxx --obo path/to/data/xx.obo --goa path/to/data/xx.goa --pipev 1,2,3,4

Notes:

  • Here the processed dataset will be your own data (path/to/data/xxx)
    • This will be the set of correct annotations, used as a starting point for elab runs
  • OBO structure will be taken from file path/to/data/xx.obo
  • GO annotations will be collected from file path/to/data/xx.goa
    • This is used to calculate the GO class sizes and IC scores for GO classes
  • Step 3 will be very time consuming.
  • Resulting files are directed to specified data directory. We set it to be MyData here.
    • This way we are not overwriting files in the data dir
  • Continue here!!!

2. Simple run to check elab function

Simple run that performs just one small elab run, after first test. This can be tested only if the run 1 was done.

perl pipeline.pl --resdir test --data data/uniprot.1000.genego --perms 50 --pipev 5 --pipeh Smin

Notes:

  • IMPORTANT: This will work only if the 1. example was done
  • Results are directed to test directory
  • Input files will be ones derived from uniprot dataset (data/uniprot.1000.genego file)
  • Only the elab step will be performed (--pipev 5)
  • Number of generated datasets at each signal level (--perms) is set to 50. Use this only for testing. For real science use values 200 - 1000
  • This run only does tests with Smin related Evaluation Metrics (--pipeh Smin)
    • Therefore this only creates two parallel runs at parallelized step.

3. Check preliminary steps and do a small elab run

Simple run that checks preliminary steps (excluding IC2 calculation) and performs small elab run

perl pipeline.pl --resdir test --data data/cafa2012.genego --perms 40 --pipev 1,2,4,5 --pipeh Jacc

Notes:

  • Results are again directed to test directory
  • Input files will be ones derived from CAFA dataset (data/cafa2012.genego file)
  • Preliminary steps (1,2,4) and elab step (5) will be performed.
  • Number of generated datasets at each signal level (--perms) is set to 40. Use this only for testing. For real science use values 200 - 1000
  • This run only does tests with Jaccard related Evaluation Metrics (--pipeh Jacc)
    • Therefore this also creates only two parallel runs at parallelized step.

4. Test whole pipeline with one dataset

Simple run that performs preliminary steps (excluding IC2 calculation), does all analysis steps and creates result tables but excludes the figures

perl pipeline.pl --resdir test --data data/cafa2012.genego --perms 100 --pipev 1,2,4,5,6,7,9 --pipeh AUC

Notes:

  • This is the first example of the whole analysis task, with preliminary steps (1,2,4), analysis steps (5,6,7) and summaries (9)
  • Input files will be ones derived from CAFA dataset (data/cafa2012.genego file) in steps 1,2 and 4
  • You could skip steps 1,2 and 4, if you have performed them already in previous runs for processed dataset.
  • Again we skip the calculus of IC2 (step 3), as it is significantly more time consuming in our pipe.
  • Analysis steps include:
    • Elab runs (step 5)
    • Generation False Positive Sets (step 6)
    • Scoring False Positive Sets (step 7)
  • Summaries of the results are generated with steps 7 and 9.
    • We skip here step 10, as it requires results from all the compared evaluation metrics.
    • Here we limited the analysis only to AUC evaluation metrics
  • Results are again directed to test directory. This will overwrite previous result tables, if those exist
  • Number of generated datasets at each signal level (--perms) is set to 100.
  • This run only does tests Area Under Curve related Evaluation Metrics (--pipeh AUC)
    • This generates 7 parallel runs

5. Whole manuscript pipeline with less repeats

Repeat manuscript work, but with a significantly smaller number of repeats (=100)

perl pipeline.pl --resdir test_p_100 --perms 100

Notes:

  • Here results are directed to folder test_p_100
  • In principle steps 1,2 and 4 could be omitted, if they were already done with all datasets
  • Perms parameter again sets the number of repeats (here 100)
  • This code processes all three datasets (discussed in manuscript)
    • Therefore large part of the code is repeated three times
  • Now parallel part will generate 30 parallel runs

6. Repeat the manuscript work

Repeat manuscript work with default number of repeats (1000 repeats !)

perl pipeline.pl --resdir repeat_manuscript

Notes:

  • Here we direct the results to folder repeat_manuscript
  • This code processes all three datasets.
  • This run will generate 30 parallel runs. Each run will process 11*1000 generated datasets.
    • These are further repeated three times (=three input datasets)
  • Therefore, make sure you have enough computing resources for this run.



Example runs for goscore program

Here we demonstrate how to calculate different evaluation metrics using goscores executable. Notice that using goscores requires other data files besides the predicted set of GO classes and correct set of GO classes. These can be obtained by running our pipeline.pl script with parameter -pipev 1,2,3 in following manner:

perl pipeline.pl --datadir MyData/ --obo path/to/data/xx.obo --goa path/to/data/xx.goa --pipev 1,2,3

NOTES

  • All the final and temporary data files are now directed to folder MyData
  • Remember to use these files in your call to the goscores executable.
  • OBO structure will be taken from file path/to/data/xx.obo
  • GO annotations will be taken from file path/to/data/xx.goa

Notice that although these steps and especially step 3 will take time, you need to run them only once for each set of GO annotations (GOA file). Below is an demo where we use obo and goa files that we distribute with ADS depository.

perl pipeline.pl --datadir MyData/ --obo data/go-2016-03.obo --goa data/gene_association.goa_ref_uniprot.flt --pipev 1,2,3

1. Prototype call for goscores program

Here is the prototype call:

bin/goscores -p path/to/data/pred.set -t path/to/data/corr.set -b path/to/data/obo.file -i path/to/data/ic_file -m $mode 1> res.file

NOTES:

  • pred.set Evaluated set of GO predictions
  • corr.set Correct GO predictions for a set of genes
  • obo.file Obo file containing go structure
  • ic_file Information Content (IC) scores for each GO class
  • res.file Result file
  • $mode this is the very string that defines the evaluation metric (discussed below)

Here is a demo call that uses uniprot.1000 dataset as correct set. All the IC scores and parent lists are predefined in the first pipeline run. Here the prediction set is left open (xxx). This would be the results from classifier that performs prediction for the genes in the correct set (data/uniprot.1000.genego)

bin/goscores -p xxx -t data/uniprot.1000.genego -b data/go-2016-03.obo -g -i MyData/ic.tab -m LIST=gene,TH=all,SUMF1=mean,SF=AUCPR 1> res.file

NOTES:

  • correct result is data/uniprot.1000.genego
  • predicted set is xxx
  • obo file is data/go-2016-03.obo
  • IC file is MyData/ic.tab
  • mode is LIST=gene,TH=all,SUMF1=mean,SF=AUCPR (Term Centric AUC-PR)

2. Mode parameter in goscores

Goscores program allows you to calculate many evaluation metrics with different settings. These metrics and their variations are defined in goscores with mode parameter. Below we show in detail how to define different evaluation metrics using the mode parameter. Overview of the evaluation metrics, tested in the manuscript, is found here. More information on these various evaluation metrics can be found in our preprint and in our supplementary text.

Defining mode parameter for AUC evaluation metrics

Here is the demo call for AUC and Fmax metrics. These require the propagation to ancestor classes (-g). These commands do not require IC values

bin/goscores -p xxx -t data/uniprot.1000.genego -g -b data/go-2016-03.obo -m $mode

Here we explain how the mode parameter is used to call different Area Under Curve (AUC) evaluation metrics. So this is the string that should be given after "-m" in the command line.

abbreviation description mode string
US AUC Unstructured AUC LIST=gene_go,SF=AUC3,GOC=$goc_on
GC AUC Gene Centric AUC LIST=go,TH=all,SUMF1=mean,SF=AUC3,GOC=$goc_on
TC AUC Term Centric AUC LIST=gene,TH=all,SUMF1=mean,SF=AUC3
US AUCPR Unstructured AUC-PR LIST=gene_go,SF=AUCPR
GC AUCPR Gene Centric AUC-PR LIST=go,TH=all,SUMF1=mean,SF=AUCPR
TC AUCPR Term Centric AUC-PR LIST=gene,TH=all,SUMF1=mean,SF=AUCPR

Defining mode parameter for Fmax evaluation metrics

Here we explain how the mode parameter is used to call Fmax evaluation metric. Our manuscript only tested the Gene Centric Fmax.

abbreviation description mode string
Fmax Gene Centric Fmax LIST=go,SF=Fmax

Defining mode parameter for group semantic measures

Here is the demo call group semantic measures. These are Smin, SimGIC, plain Jaccard and their variations. These require the propagation to ancestor classes (-g). These also require IC values (-i filename). With the Smin and SimGIC commands below you can use either IC or IC2 values.

bin/goscores -p xxx -t data/uniprot.1000.genego -g -b data/go-2016-03.obo -i MyData/ic.tab MyData/goparents.tab -m $mode

Here we explain how mode parameter is used to call group semantic measures

abbreviation description mode string
Smin1 Unstructured Smin SF=Smin1
Smin2 Gene Centric Smin SF=Smin2
Smin3 Variation on Smin1 SF=Smin3
SimGIC Gene Centric wJacc SF=SimGIC
SimGIC2 Unstructured wJacc SF=SimGIC2
SimAll All 5 scores above SF=SminAll
GC Jacc Gene Centric Jacc LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=Jacc
US Jacc Unstructured Jacc LIST=gene_go,TH=th,SUMF=max,SF=Jacc

Defining mode parameter for semantic similarities

Here is the demo call for semantic similarities. You should NOT use propagation to parents with these metrics. You must use IC values with these metrics. However, you should NOT use IC2 values with these metrics.

bin/goscores -p xxx -t data/uniprot.1000.genego -b data/go-2016-03.obo -i MyData/ic.tab MyData/goparents.tab -m $mode

Here we explain how the mode parameter is used to call different versions of Lin semantic similarity. Here A, B, C, D, E and F represent different ways how semantic similarities could be combined (see explanations here.

Lin A Lin score with A LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_A,SIMF=Lin
Lin B Lin score with B LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_B,SIMF=Lin
Lin C Lin score with C LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_C,SIMF=Lin
Lin D Lin score with D LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_D,SIMF=Lin
Lin E Lin score with E LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_E,SIMF=Lin
Lin F Lin score with F LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_F,SIMF=Lin

Here we show Resnik semantic similarity and parental Jaccard (Ancestral Jaccard) with semantic similarity summary method A. Naturally you can run these semantics with other similarity summary methods.

Resnik A Resnik score with A LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_A,SIMF=Resnik
AJacc A Ancestral Jacc with A LIST=go,TH=th,SUMF1=mean,SUMF2=max,SF=score_A,SIMF=pJacc



Using ADS datasets to test other evaluation metrics

Here we represent how one can use ADS method to test the other evaluation metrics. This could be also in-house developed evaluation metric that you want to test for its performance.

1. Create your own ADS datasets

Here is an example pipeline run that generates 20 ADS test datasets at each noise level:

perl pipeline.pl --resdir test/ADS_data/ --data data/uniprot.1000.genego --perms 20 --pipev 5 --pipeh data

NOTES

  • This code assumes that you have already done the steps 1,2,3 and 4 for the used dataset.
  • Gold Standard data here is data/uniprot.1000
  • Resulting files are here test/ADS_data/uniprot.1000
  • Each file has comment rows at the beginning (starting with # character)
  • All 20 datasets, from the same noise level, are within the same file
  • Each dataset is identified by the running number at the beginning of the row
    • So all the rows that belong, say to first dataset, will start with number 0
    • Next datasets will be identified with consecutive integers (1,2,3,4 ...)
  • For real analysis we recommend selecting perms parameter between 200 - 1000.

2. Extract one generated dataset from one result file

Next one has to extract each separate dataset from one ADS file

You can extract, for example the first dataset from each file, using a following type of command:

grep "^0[^0-9]" uniprot.1000.genego.sc.elab.c0 > subset.c0

Here the first number in the grep command defines which dataset you will select

This grep command will include the running number at the beginning of the row

You can also extract the rows so that the identifying integer and the first tab are removed

sed -n -e 's/^0\t//p' uniprot.1000.genego.sc.elab.c0 > subset.c0

Again the first number in the command defines which dataset you will select

Now you can use subset.c0 as an input to your evaluation metric programs

3. Create False Positive Sets

False Positive Sets (FP sets) are another important test for evaluation metrics. Our manuscript shows that this reveals different weaknesses besides ADS test. We have a separate function for generating these:

COMMANDS COMING SOON ...

Citing ADS project

To cite ADS please cite this article:

Plyusnin, Ilya, Liisa Holm, and Petri Törönen. "Novel comparison of evaluation metrics for gene ontology classifiers reveals drastic performance differences." PLoS computational biology 15.11 (2019). [PMID: 31682632]

Erratum for the main article

Table 1 contains errors in column marking the recommended metrics. Our strongest recommendation, ic SimGIC2, is erroneously marked as only "potential recommendation". Two potential recommendations are missed: ic SimGIC2 and ic2 Smin1. And one erroneous recommendation included: AJACC E.

The corrected Table 1 is available here.

Supplementary files

Most relevant supplementary files are listed here:

Code and data

All code and data files are available at the main ADS repository