User manual for ISS_ProtSci

ISS_ProtSci [1] is a Python package designed for structural similarity search in the AlphaFold Database v2. It uses DaliLite to identify database structures that are geometrically similar to a given Query structure. The search utilizes a transitive closure algorithm to explore neighboring shells based on a precomputed all-against-all comparison, which was generated using Foldseek for its speed, despite some imprecision. The results are annotated by integrating metadata from the UniprotKB sequence database and the Pfam protein family classification, with protein domains identified using hmmsearch. A worked example explains how to do search, annotation and analysis with ISS_ProtSci.

A tutorial is available.

Citation:

1. Liu et al. (2025) 3-D substructure search by transitive closure in AlphaFold Database. Protein Science 34:e70169

Table of contents

Installation via Docker
Manual installation
	System requirements
	Downloads
	Configuration
Worked example
	Easy workflow using shell scripts
	Running the steps manually
	Outputs of analysis script
	Plots
	Superimposed coordinates
Benchmark results
Troubleshooting
Appendix A: Output format of {cd1}.AFDB2.tsv and {cd1}.AFDB2.pf.tsv
Appendix B: Data server

Installation via Docker

This guide will help you set up and run the hao1999/iss_protsci Docker image, which contains the ISS_ProtSci. This environment provides all necessary dependencies and tools bundled within the Docker container for seamless operation.

Prerequisites

Ensure Docker is installed on your system. Docker provides a platform to run the ISS_ProtSci in an isolated environment with all its dependencies. You can download and install Docker from the official Docker website.

Instructions

1.Pull the Docker Image

docker pull hao1999/iss_protsci

2.Run the Container and download necessary databases

# Create and start a container named iss_protsci, mounting a Docker volume for data persistence. During the initial run, the container will download necessary databases. This includes downloading the AFDB2 database for DaliLite (9.4GB compressed).

docker run -it --name iss_protsci -v iss_data:/data hao1999/iss_protsci:latest

3.Activate the Conda Environment

# Inside the container, activate the iss_env Conda environment using Miniconda.

source /opt/miniconda/bin/activate iss_env

4. Run a Test Case

# Execute the test script.(hcase_1: output directory name.)

/root/ISS_ProtSci-1/0_test.csh -o hcase_1

Note on Parallel Processing

The software uses Dali for parallel computations. You can adjust the number of parallel processes based on your CPU resources by modifying the DALI_NPAR value in the ISS.ini configuration file located at /root/ISS_ProtSci-1/scripts/ISS.ini. The default value is 8.

Verifying the Results

The test script reports success or failure. In the case of failure, run troubleshooting checks or run steps manually.

Manual Installation

If you are unable to use Docker, you can follow the steps below to manually set up the environment. Please be aware that setting up the environment manually requires attention to the versions of the dependencies, and this process can be time-consuming.

System Requirements

Notes on DaliLite Installation

Downloads


# Download scripts, Foldseek and Pfam data, and benchmark results (compressed size: 3 GB)
wget http://ekhidna2.biocenter.helsinki.fi/ISS_ProtSci/ISS_ProtSci.tar.gz
tar -zxvf ISS_ProtSci.tar.gz

# Navigate to the installation directory
cd ISS_ProtSci-1

# Check installation
python3 scripts/ISS_closure.py

# Expected output:
# default_config =  /home/luholm/ISS_ProtSci-1/scripts/ISS.ini
# class Query: None None None
# dowo, dofs False True

# Download AFDB2 database for DaliLite
wget http://ekhidna2.biocenter.helsinki.fi/dali/AF-Digest.tar.gz
tar -zxvf AF-Digest.tar.gz

Site-specific configuration is necessary.

Configuration

Edit scripts/ISS.ini. Find the following lines and update the paths according to your local installation:


# ISS installation directory
ISS_SCRIPT_HOME=/home/luholm/ISS_ProtSci-1/

# Dali installation directory
DALI_HOME=/home/luholm/DaliLite.v5/

# Location of AFDB2 database for DaliLite
DALI_AFDB_DAT_UNCROPPED=/data/liisa/alphafold/DAT/

# Foldseek executable
FOLDSEEK_EXE=/usr/local/bin/foldseek

Configuring Dali Execution

If you have OpenMPI installed, Dali will run in parallel by default. If not, configure Dali to run serially by modifying the following setting:


# Run Dali in parallel
DALI_NPARA=40

Change to:


DALI_NPARA=1

Worked example

Easy workflow using shell scripts

Linux shell scripts are provided which (1) do database search, (2) add metadata to the results, (3) analyze the results using Pfam clans as reference.

To run the shell scripts, we here set a shorthand environment variable ISShome:

ISShome=~/ISS_ProtSci-1/

ISShome is where the distribution package was installed.

Here, we assume that user inputs are set as environment variables (replace values as appropriate):

pdbfile=$ISShome/testset/pdbmr59.ent.gz
cd1=testA
minlali=180
zcut=2.0
clan=CL0192

Setting user-defined values for minlali and zcut is optional, as the scripts have reasonable defaults.

Before you start running the shell scripts, prepare your workspace, creating and navigating to a working directory:

mkdir -p mywork
cd mywork

The shell scripts can be executed as follows (using the environment variables set above):

$ISShome/scripts/1_search.csh -pdbfile $pdbfile -cd1 $cd1 -minlali $minlali -zcut $zcut > $cd1.log
$ISShome/scripts/2_annotate.csh -pdbfile $pdbfile -cd1 $cd1 > $cd1.fragment.stat
$ISShome/scripts/3_analyze.csh -clan $clan -cd1 $cd1 -minlali $minlali -zcut $zcut > $cd1.stat

Precomputed outputs for our test case can be found in $ISShome/testset/case_8/.

Running the steps manually

There is usually no reason to run each step manually. You may want to do this, if one of the shell scripts unexpectedly crashes. Check which output files are missing and continue from the broken step.

Here, we assume that user inputs are set as environment variables (replace values as appropriate):

pdbfile=$ISShome/testset/pdbmr59.ent.gz
cd1=testA
minlali=180
zcut=2.0
clan=CL0192

Step 1: Transitive closure

python3 $ISShome/scripts/ISS_closure.py --pdbfile $pdbfile --pdbid $pdbid --target AFDB2 --minlali $minlali

This creates $cd1.AFDB2.dali.tsv

Step 2.1: Annotate matched proteins with Uniprot accession, taxonomy id, gene name, species, description, amino acid and DSSP (3-state secondary structure) sequences

python3 $ISShome/scripts/ISS_annotate.py --cd1 $cd1 --pdbfile $pdbfile --target AFDB2

This creates $cd1.AFDB2.tsv and $cd1.AFDB2.seriated.tsv

Step 2.2: Assign common core fragments to Pfam families

hmmlib=$ISShome/pfam-data/Pfam-A.hmm
CLAN_TABLE=$ISShome/pfam-data/clan_pfam.tsv
tmpfasta=$$.fasta
tmpout1=$$.out1
tmpout2=$$.out2
grep AFDB $cd1.AFDB2.tsv | cut -f 2,29 | perl -pe 's/^/>/' | perl -pe 's/\t/\n/' > $tmpfasta
hmmsearch -o $tmpout1 --acc --noali --notextw --cut_tc $hmmlib $tmpfasta
perl $ISShome/scripts/parse_hmmer.pl < $tmpout1 | sort -gk3 | perl $ISShome/scripts/pickfirst.pl > $tmpout2
$ISShome/scripts/tbljoin -l $cd1.AFDB2.tsv $tmpout2 > $tmpout1
$ISShome/scripts/tbljoin -l $tmpout1 $CLAN_TABLE > $cd1.AFDB2.pf.tsv
rm -f $tmpfasta $tmpout1 $tmpout2

This creates $cd1.AFDB2.pf.tsv

Step 2.3: Count matched common core fragments per Pfam family

cut -f 33,35 $cd1.AFDB2.pf.tsv | sort | uniq -c | sort -n

Step 2.4: Do direct Foldseek search for comparison


fs1=$cd1.fsdirect.tsv
fs2=$cd1.fsdirect.dali.tsv
t1=$$.1
# Foldseek run with --max-seqs 50000
python3 $ISShome/scripts/ISS_foldseek.py --pdbfile $pdbfile > $fs1
# validate Foldseek 1st shell (e<1)  by Dali
gawk ' $3 < 1 ' $cd1.fsdirect.tsv | cut -f 2 > $t1
# --pdbfile signals that it is a local file
python3 $ISShome/scripts/ISS_rundali.py --cd1 $cd1 --pdbfile dummy --target $t1 --tsvfile $fs2
# clean up
rm -f $t1

This creates $cd1.fsdirect.tsv (results of Foldseek search) and $cd1.fsdirect.dali.tsv (Dali results on direct Foldseek hits). These files are used for comparing the performance of Foldseek to the transitive closure search.

Step 3.1: Filter clan members that are structurally similar to Query (sensu Dali)

tmplist=$$.list
grep $clan clan_hmmer_tc.tsv | cut -f 2 | sort | uniq > $tmplist

## --pdbfile signals that it is local
python3 $ISShome/scripts/ISS_rundali.py --cd1 $cd1 --pdbfile dummy --target $tmplist --tsvfile $cd1.$clan.TRUE.dali.tsv
rm -f $tmplist

This creates $cd1.clan.TRUE.dali.tsv

Step 3.2: Output statistics

When all input files are in existence, this step is best done calling the shell script:

$ISShome/scripts/3_analyze.csh -clan $clan -cd1 $cd1 -minlali $minlali

Outputs of analysis script

The output is written to a file named $cd1.stat. The command

grep EVAL $cd1.stat

gives a summary table of (TRUE_0, TP_0, P_0, TRUE_1, TP_1, P_1, TRUE_2, TP_2, P_2), where suffix _0 refers to Pfam assignments based on hmmsearch with trusted cutoff, sets _1 are filtered by geometrical siilarity (Dali Z-score > 2), and sets _2 are further filtered by common core size. TRUE is the count of AFDB2 proteins in the Pfam family, TP the count of true positives, and P the count of positives. Each query structure has one row for transitive closure, and several rows for Foldseek at varying e-value cutoffs. Our example gives the following data:

        method          TRUE_0  TP_0   P_0     TRUE_1  TP_1   P_1     TRUE_2  TP_2    P_2
EVAL    closure         10300      0   35165   9339    9295   11318   7544    7543    8091
EVAL    foldseek(10)    10300   5673    8969   9339    3412    3700   7544    2846    3009
EVAL    foldseek(1)     10300   3597    3946   9339    3412    3700   7544    2846    3009
EVAL    foldseek(0.1)   10300    661     684   9339     634     654   7544     543     557
EVAL    foldseek(0.01)  10300     66      67   9339      66      67   7544      63      64
EVAL    foldseek(0.001) 10300     59      60   9339      59      60   7544      58      59
TP_0 (no filtering) of transitive closure is undefined, because it only reports Dali-similar matches. Transitive closure clearly reports more true positives than Foldseek. The numbers in the table enable the caclucation of precision (pr=TP/P), recall (rc=TP/TRUE) and their harmonic mean, the F1-score (F1=2*pr*rc/(pr+rc). The F1-score can be used to compare the performance of binary classifiers. In our example, transitive closure achieves an F1-score of 0.96, which is clearly above Foldseek's 0.59.

Counting protein matches can be biased to large families. The following fields report how many member families of the clan have at least one match (transitive closure search reached all but one family):

Finally, there are Z-score histograms that allow the comparison of transitive closure and Foldseek. Only matches that pass the filter on common core size are included. Both methods find the same proteins with the highest Z-scores, but Foldseek starts dropping geometrically good matches below Z-scores of 18, which implies very strong structural resemblance.

Below, the histograms for our example case are pasted next to each other

Z-score   closure_count   closure_cum_count       foldseek_count   foldseek_cum_count
42      1       1       1       1
36      4       5       4       5
35      4       9       4       9
34      2       11      2       11
33      3       14      3       14
31      2       16      2       16
30      15      31      15      31
29      8       39      8       39
28      7       46      7       46
27      6       52      6       52
26      2       54      2       54
25      1       55      1       55
24      3       58      3       58
23      1       59      1       59
17      2       61      0       59
16      38      99      10      69
15      43      142     8       77
14      69      211     20      97
13      116     327     25      122
12      297     624     68      190
11      799     1423    218     408
10      1800    3223    658     1066
9       2835    6058    1244    2310
8       1743    7801    621     2931
7       233     8034    71      3002
6       47      8081    6       3008
5       8       8089    1       3009
4       1       8090    0       3009
3       1       8091    0       3009

Plots

Histograms of Z-scores

The histograms are found at the end of $cd1.stat. In Linux, the data blocks can be copy-pasted to files and merged:

gawk -v minlali=$minlali ' $5 > minlali ' $cd1.$target.dali.tsv | cut -f 3 | python3 $ISShome/scripts/histcol.py > x
gawk -v minlali=$minlali ' $5 > minlali ' $cd1.fsdirect.dali.tsv | cut -f 3 | python3 $ISShome/scripts/histcol.py > y
# convert to tsv, select columns to plot
paste x y| perl -pe 's/ +/\t/g' | cut -f 1,2,6

The first column is the Z-score bin, the second is counts for transitive closure search, the third is counts for direct Foldseek search. Counts are of matched proteins with at least $minlen structurally equivalent positions with the Query structure. The difference between the closure and Foldseek curves in a scatterplot shows the gain by transitive search.

Scatterplot of alignment length vs. Dali Z-score

This Linux command extracts the Z-score, alignment length and Pfam labels:

cut -f 4,6,33,35 $cd1.AFDB2.pf.tsv

Use the labels to group the data into data series representing different Pfam families or clans.

Stacked alignments

Stacked alignments can be extracted from $cd1.AFDB2.tsv (or $cd1.AFDB2.pf.tsv) and displayed as text or converted to Fasta format for display in a multiplse sequence alignent viewer. Consult Appendix A for filtering options to select subsets of data.

# Secondary structure strings in stacked alignment
gawk ' $2 == "AFDB2" ' $cd1.AFDB2.tsv | cut -f 30
# The same as above filtered on alignment length
gawk -v minlali=$minlali ' $2 == "AFDB2" && $6 > $minlali ' $cd1.AFDB2.tsv | cut -f 30
# Seriated data looks nicer
gawk ' $2 == "AFDB2" ' $cd1.AFDB2.seriated.tsv | cut -f 30
# Fasta formatted sequences in stacked alignment
gawk ' $2 == "AFDB2" ' $cd1.AFDB2.tsv | cut -f 2,31 | perl -pe 's/^/>/' | perl -pe 's/^t/\n/'

Superimposed coordinates

Running the transitive closure program only requires a locally installed Dali Digest of AFDB2, not the original PDB files in the AlphaFold Database. If you want to work with a AFDB2 model found as a match, you can do the following steps:
  1. Download a target model from AlphaFold Database: get the Uniprot accession number from column 9 of $cd1.AFDB2.tsv, go to https://www.uniprot.org/uniprotkb/, paste the accession number in the search box, navigate to Structure section of the entry page, click on AlphaFold download icon. This gives you a PDB formatted file.
  2. Get the rotation matrix U and translation vector T from columns 25 and 26 of $cd1.AFDB2.tsv, respectively.
  3. Apply the transformation UX+T to target model coordinates X. This superimposes the target model onto the Query structure in a least squares fit of the structurally equivalent CA atoms. In our example, let's say n5nvA is an interesting match:

    cd1=testA
    cd2=n5nvA
    # replace file name by your downloaded file
    afdbfile=$ISShome/n5nv.pdb
    gawk -v cd2=$cd2 ' $2 == $cd2 ' $cd1.AFDB2.tsv | cut -f 25,26 | perl $ISShome/scripts/ISS_applymatrix.pl $afdbfile > $cd2.transformed.pdb

Now you can load the Query structure and $cd2.transformed.pdb in your favourite molecular graphics viewer. In our example, column 5 in $cd1.AFDB2.tsv tells that the Rmsd is 1.8 A, which will look nice. Note that Dali does not optimize Rmsd, and some structural alignments yield high Rmsd values (rmsd >5 A definitely look ugly in superimposition).

Benchmark results

Results for twelve test cases are found under $ISShome/testset/. The results in subdirectories $ISShome/testset/case_*/ were generated using the script run_testset.csh. There is no need to rerun this script, as benchmark results are pre-computed.

For each case, lines of the file testA.stat starting with "EVAL" give a summary of (TRUE_0, TP_0, P_0, TRUE_1, TP_1, P_1, TRUE_2, TP_2, P_2), where suffix _0 refers to Pfam assignments based on hmmsearch with trusted cutoff, sets _1 are filtered by geometrical siilarity (Dali Z-score > 2), and sets _2 are further filtered by common core size. TRUE means ground truth, TP means true positive and P means positive. Each query structure has one row for transitive closure, and several rows for Foldseek at varying e-value cutoffs.

Troubleshooting

Check configuration file ISS.ini

The Docker image should work as is. However, if ISS_ProtSci was installed manually, there are hardcoded paths in the configuration file ISS.ini, which should have been edited to correspond to your local setup: ISS_SCRIPT_HOME (where ISS_ProtSci is installed), DALI_HOME (where DaliLite is installed), and DALI_AFDB_DAT_UNCROPPED (location of Dali Digest data set of AlphaFold.v2 models imported to DaliLite).

The search step requires three external programs, which must be installed on your system: DaliLite, Foldseek, and a DSSP program called from within DaliLite (see note on DSSP versions on download page). The executables of DaliLite and Foldseek are defined in ISS.ini as DALI_EXE and FOLDSEEK_EXE.

Check connection to data server

The search requires a working Internet connection to access the remote data server. Test the connection:

python3 /path/to/ISS_ProtSci-1/scripts/ISS_remote_sql.py

Output should be similar to:

# default_config =  /data/liisa/ISS_ProtSci-1/scripts/ISS.ini
get_afdb2_cropaln ["1,26,56,62", "1,30,64,74", "25,30,6,94", 155]
afdb_repset ( 100 ) ['fbuzA', 'gyskA'] ... ['d9zaA', 'dpixA']
get_repset ['fbuzA', 'gyskA', 'c402A', 'fquwA', 'no7sA']
get_repset_straight_list ["fbuzA", "gyskA", "c402A", "fquwA", "no7sA"]
get_repset_straight_list_inf ["fbuzA", "gyskA", "c402A", "fquwA", "no7sA"]
annotate_afdb_resultlist[0]:
         fbuzA
         Q96N28
         PLD3A_HUMAN
         172
         PRELI domain containing protein 3A
         9606
         PRELID3A
         Homo sapiens
         MKIWSSEHVFGHPWDTVIQAAMRKYPNPMNPSVLGVDVLQRRVDGRGRLHSLRLLSTEWGLPSLVRAILGTSRTLTYIREHSVVDPVEKKMELCSTNITLTNLVSVNERLVYTPHPENPEMTVLTQEAIITVKGISLGSYLESLMANTISSNAKKGWAAIEWIIEHSESAVS
         LEEEEEEEEELLLHHHHHHHHHLLLLLLLLLLEEEEEEEEEEELLLLLEEEEEEEEEELLLLHHHHHHHLLLLLEEEEEEEEEEELLLLEEEEEEEELLLLLLEEEEEEEEEEELLLLLLLEEEEEEEEEEELLLLLHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHLLL
         1,26,56,62
         1,30,64,74
         25,30,6,94
         155
         1
shell_reps ( 1236 ) [['no7sA', 'no7sA', 0.0, 1, 476, 1, 476]] ... [['gyskA', 'iktvA', 9.237e-06, 3, 156, 219, 370]]
list_to_evalues [["fquwA", 1.282e-05], ["gyskA", 1.02e-05], ["fbuzA", 2.045e-29], ["c402A", 1.08e-05], ["no7sA", 2.025e-05]]
filter_list_for_targets ["fquwA", "gyskA", "fbuzA", "c402A", "no7sA"]

Check Foldseek

Pre-indexed databases used by Foldseek are included in the ISS_ProtSci distribution package. The location of the Foldseek executable is defined by FOLDSEEK_EXE in the ISS.ini file.

Check hmmer3

The annotation and analysis steps require hmmsearch of the hmmer3 package, and associated data files.

Check hmmsearch is in your path:

which hmmsearch

/usr/local/bin/hmmsearch

hmmsearch

Incorrect number of command line arguments.
Usage: hmmsearch [options]  

where most common options are:
  -h : show brief help on version and usage

To see more help on available options, do hmmsearch -h

Check that required data files are present. These are included in the ISS_ProtSci distribution package.

cd /path/to/ISS_ProtSci-1/pfam-data/

ls -s clan_pfam.tsv Pfam-A.hmm clan_hmmer_tc.tsv

  38900 clan_hmmer_tc.tsv      144 clan_pfam.tsv  1535348 Pfam-A.hmm

Check DaliLite components

The location of DaliLite is defined by DALI_HOME in the ISS.ini file. DaliLite can be tested by running the script test.csh.

The most common problem with DaliLite installations is an incompatible DSSP version (see section 2.3 of DaliLite web page).

Appendix A: Output format of {cd1}.AFDB2.tsv and {cd1}.AFDB2.pf.tsv

All *.tsv outputs contain tab-separated values with a header line. {cd1}.AFDB2.tsv includes columns 1-32 and {cd1}.AFDB2.pf.tsv includes columns 1-35. Blank rows indicate columns reserved for other use in the future.
Column	Heading	                Description
1       Rank                    Rank of match in descending order of Dali Z-scores
2       Sbjct                   Dali identifier of matched structure (sbjct) in AFDB2
3       Database                Query/AFDB. Flags special data for query protein which is needed for html output.
4       Z-score                 Dali Z-score. Higher values are better.
5       RMSD                    Root-mean-square deviation of superimposed CA atoms. It is not optimized by Dali. High values don't look good in 3D-viewer.
6       Ali-length              Number of structurally equivalent CA atoms in the Dali alignment.
7       
8       Seq-identity            Percentage of identical amino acids of structurally aligned residues.
9       Accession               Uniprot accession number of AlphaFold model
10      Protein-identifier      Uniprot protein identifier of AlphaFold model
11      Sbjct-length            Number of amino acids of sbjct sequence in Uniprot
12      Description             Short description of sbjct protein from Uniprot
13      Taxid                   NCBI taxonomy id of sbjct protein
14      Gene                    Sbjct protein's gene name from Uniprot
15      Species                 Sbjct protein's species from Uniprot
16      Query                   Query identifier
17      Query-length            Number of amino acids in query structure
18      Query-coverage          Ali-length divided by Query-length
19      Sbjct-coverage          Ali-length divided by Sbjct-length-cropped
20      
21      Foldseek-evalue         E-value of direct Foldseek search with max-seqs 50000
22      qstarts                 Start positions of ungapped aligned segments in query
23      sstarts                 Start positions of ungapped aligned segments in sbjct
24      lengths                 Lengths of ungapped aligned segments
25      rotation                3x3 translation matrix U
26      translation             translation vector T. UX+T superimposes sbjct coordinates X onto the query structure.
27      Sbjct-sequence          Amino acid sequence of sbjct
28      Sbjct-dssp              DSSP assignments for sbjct. H = alpha helix, E = beta strand, L = coil
29      sequ-pileup             How sbjct-sequence appears in stacked alignment, ignoring insertions
30      DSSP-pileup             How sbjct-dssp appears in stacked alignment, ignoring insertions
31      dssp-order              Rank of match in seriated view, which maximizes similarity of secondary structure between adjacent rows
32      
33	pfam	                Pfam identifier of profile that gave the best evalue for sequ-pileup in hmmsearch against Pfam-A.hmm
34	hmmer_evalue	        evalue from hmmsearch
35	clan	                Clan of protein family (pfam, col. 33)  as assigned in Pfam-C. NA if unassigned.

Appendix B: Data server

The standalone Python script uses a remote data server in Helsinki, which is set up for searching the AFDB2 database using Dali identifiers as keys. The identifiers are the same as in the preprocessed AlphaFold database for DaliLite. If you should want to set up your own data server, this section explains the schema of the data tables. It is up to you to populate the tables with data, as you would be using your own keys. DaliLite requires that structure identifiers (used as keys) are five characters long, composed of a four-letter unique identifier and the chain identifier.

The server runs scripts/ISSserver.py as a daemon process. To use the local data server, import ISS_sql.py (found in scripts/) instead of ISS_remote_sql.py in all ISS_* scripts.

The data server uses two sqlite3 databases. Their location is hardcoded in the select_action function of ISSserver.py:

p['sqlite3_database']='/data/liisa/ISS/ISS.db'
p['bigdb_database']='/data/liisa/alphafold/TESS/test.db'

Replace the paths according to your local configuration.

Data to set up a lightweigt version of the Helsinki data server using Dali Digest keys is available:

  1. mini_afdb_afdb.tsv.gz (20 GB): Foldseek all-against-all comparison data used to look up neighbor shells (table foldseek_afdb_afdb), restricted to pairs with Foldseek e-values below 0.01. Propagation during the search explores neighbor shells with e-values less than 0.01. Thus, this reduced data table can be used with the shell script 1_search.csh which uses PDB formatted files as query - in this case, entry points are generated by running Foldseek with the query structure with the e-value threshold set to 1. Warning: The Python script ISS_closure.py also accepts Dali-identifiers as query, but their entry points are looked up in the precomputed database using an e-value threshold of 1, but data with e-values between 0.01 and 1 are missing from the mini-table.
  2. afdb2_metadata.tsv.gz: data for table afdb2_metadata
  3. dbref.tsv.gz: data for table dbref
  4. dssp.tsv.gz: data for table afdb2_dssp
  5. sequ.tsv.gz: data for table afdb2_sequ
  6. species_taxid.tsv.gz: data for table species_taxid
(Open an sqlite3 session with appropriate database, create tables as defined below,import data from the decompressed tsv files, create indices. )

The standalone script expects the following table names with content as described.

Tables in bigdb_database

foldseek_afdb_to_afdb

This table contains the all-against-all comparison data used to look up neighbor shells. We used Foldseek for AFDB2, but other programs could be used.

fieldtypecomment
cd1char(5), indexedDali structure identifier of 'query' structure
cd2char(5)Dali structure identifier of 'sbjct' structure
qfromintegerstart of 'query' alignment
qtointegerend of 'query' alignment
sfromintegerstart of 'sbjct' alignment
stointegerend of 'sbjct' alignment
evaluefloata score used to rank neighbors; smaller scores have higher priority

The all-against-all data was generated using Foldseek command:

foldseek easy-search pdb-directory afdb2 result.m8 tmp -e 1 --max-seqs 50000

The AFDB2 run took 5 days with 40 CPUs.

Tables in sqlite3_database

This database is accessed by the function ISS_annotate.annotate_afdb_resultlist() and the data are added as annotations to the 'sbjct' structure

dbref

AlphaFold Database version 2 models are generated from Uniprot sequences. This table maps Dali identifiers to Uniprot identifiers.

fieldtypecomment
cd1char(5) primary keyDali identifier
nresintegerlength of the protein
accvarchar(20)Uniprot's accession number
pidvarchar(20)Uniprot's protein identifier

afdb2_sequ

This table is used to generate stacked alignments of 'sbjct' proteins against the 'query' protein. It was generated using data from DaliLite's internal protein structure representation (DAT files).

fieldtypecomment
cd1char(5), primary keyDali identifier
sequenceblobamino acid sequence

afdb2_dssp

This table is used to generate stacked alignments of 'sbjct' proteins against the 'query' protein. It was generated using data from DaliLite's internal protein structure representation (DAT files).

fieldtypecomment
cd1char(5), primary keyDali identifier
dsspstringblobsequence of three-state assignments of secondary structure

afdb2_metadata

This table parsed tags on the header line of UniprotKB's sequence files in Fasta format.

fieldtypecomment
cd1char(5), primary keyDali identifier
accvarchar(20)Uniprot's unique accession number
pidvarchar(20)Uniprot's protein identifier
descblobUniprot's short description of the protein
taxidintegerNCBI taxonomy identifier
genevarchar(20)Uniprot's gene symbol

species_taxid

fieldtypecomment
speciesblobscientific name
taxidinteger, primary keyNCBI taxonomy identifier