AFP-stacking

This page contains supplementary material related to the article

Optimizing InterProScan feature processing generates a surprisingly good Protein Function Prediction method

Abstract

Automated protein Function Prediction (AFP) is an intensively studied topic. Most of this research focuses on methods that combine multiple data sources, while less articles look for the most efficient ways to use a single data source. Therefore, we wanted to test how different prepro cessing methods and classifiers would perform in the AFP task when we process the output from the InterProscan (IPS). Especially, we represent novel preprocessing methods, less used classifiers and inclusion of species taxonomy. We also test classifier stacking for combining tested classifier results. Methods are tested with in-house data and CAFA3 competition evaluation data. We show that including IPS localisation and taxonomy to the data improves results. Also the stacking improves the performance. Sur prisingly, our best performing methods outperformed all international CAFA3 competition participants in most tests. Results show how pre- processing and classifier combination are beneficial in the AFP task

Preprint

Bioarxiv link

manuscript.pdf

Supplementary text

supplement.pdf

Available also at the end of the manuscript

Data

zip archive

Links to our other projects

Our function prediction www-server, PANNZER

Testing and comparing different evaluation metrics for Automated Function Prediction task. (ADS-project)

Comparing and improving stratified cross validation for multi-class datasets

Visualization of sequence-level taxonomic similarities at genome level with AAI server