User manual for ISS_ProtSci

ISS_ProtSci [1] is a Python package designed for structural similarity search in the AlphaFold Database v2. It uses DaliLite to identify database structures that are geometrically similar to a given Query structure. The search utilizes a transitive closure algorithm to explore neighboring shells based on a precomputed all-against-all comparison, which was generated using Foldseek for its speed, despite some imprecision. The results are annotated by integrating metadata from the UniprotKB sequence database and the Pfam protein family classification, with protein domains identified using hmmsearch. A worked example explains how to do search, annotation and analysis with ISS_ProtSci.

A tutorial is available.

Citation:

1. Liu et al. (2025) 3-D substructure search by transitive closure in AlphaFold Database. Protein Science 34:e70169

Installation via Docker
Manual installation
	System requirements
	Downloads
	Configuration
Worked example
	Easy workflow using shell scripts
	Running the steps manually
	Outputs of analysis script
	Plots
	Superimposed coordinates
Benchmark results
Troubleshooting
Appendix A: Output format of {cd1}.AFDB2.tsv and {cd1}.AFDB2.pf.tsv
Appendix B: Data server

Installation via Docker

This guide will help you set up and run the hao1999/iss_protsci Docker image, which contains the ISS_ProtSci. This environment provides all necessary dependencies and tools bundled within the Docker container for seamless operation.

Prerequisites

Ensure Docker is installed on your system. Docker provides a platform to run the ISS_ProtSci in an isolated environment with all its dependencies. You can download and install Docker from the official Docker website.

Instructions

1.Pull the Docker Image

docker pull hao1999/iss_protsci

2.Run the Container and download necessary databases

# Create and start a container named iss_protsci, mounting a Docker volume for data persistence. During the initial run, the container will download necessary databases. This includes downloading the AFDB2 database for DaliLite (9.4GB compressed).

docker run -it --name iss_protsci -v iss_data:/data hao1999/iss_protsci:latest

3.Activate the Conda Environment

# Inside the container, activate the iss_env Conda environment using Miniconda.

source /opt/miniconda/bin/activate iss_env

4. Run a Test Case

# Execute the test script.(hcase_1: output directory name.)

/root/ISS_ProtSci-1/0_test.csh -o hcase_1

Note on Parallel Processing

The software uses Dali for parallel computations. You can adjust the number of parallel processes based on your CPU resources by modifying the DALI_NPAR value in the ISS.ini configuration file located at /root/ISS_ProtSci-1/scripts/ISS.ini. The default value is 8.

Verifying the Results

The test script reports success or failure. In the case of failure, run troubleshooting checks or run steps manually.

Manual Installation

If you are unable to use Docker, you can follow the steps below to manually set up the environment. Please be aware that setting up the environment manually requires attention to the versions of the dependencies, and this process can be time-consuming.

System Requirements

Internet connection
Linux operating system
DaliLite.v5
Foldseek executable
hmmsearch executable
Python 3.8+ (Python 3.6 should work but is not thoroughly tested)
Python modules: numpy, scipy, argparse, configparser, heapdict, bitstring
Perl
At least 36 GB disk space

Notes on DaliLite Installation

BLAST is not required for ISS_ProtSci as it only performs pairwise Dali comparisons.
Dali uses the third-party program dsspcmbi, which is included in the DaliLite package but may not compile correctly with modern C compilers. It can be replaced by mkdssp as insructed on the DaliLite download page.
Pay attention to version compatibility: DaliLite has been tested on Red Hat Linux with gcc 4.8.5 and mpirun (Open MPI) 1.10.7.
Use DaliLite.v5 version updated 11 April 2025 or later.

Downloads


# Download scripts, Foldseek and Pfam data, and benchmark results (compressed size: 3 GB)
wget http://ekhidna2.biocenter.helsinki.fi/ISS_ProtSci/ISS_ProtSci.tar.gz
tar -zxvf ISS_ProtSci.tar.gz

# Navigate to the installation directory
cd ISS_ProtSci-1

# Check installation
python3 scripts/ISS_closure.py

# Expected output:
# default_config =  /home/luholm/ISS_ProtSci-1/scripts/ISS.ini
# class Query: None None None
# dowo, dofs False True

# Download AFDB2 database for DaliLite
wget http://ekhidna2.biocenter.helsinki.fi/dali/AF-Digest.tar.gz
tar -zxvf AF-Digest.tar.gz

Site-specific configuration is necessary.

Configuration

Edit scripts/ISS.ini. Find the following lines and update the paths according to your local installation:


# ISS installation directory
ISS_SCRIPT_HOME=/home/luholm/ISS_ProtSci-1/

# Dali installation directory
DALI_HOME=/home/luholm/DaliLite.v5/

# Location of AFDB2 database for DaliLite
DALI_AFDB_DAT_UNCROPPED=/data/liisa/alphafold/DAT/

# Foldseek executable
FOLDSEEK_EXE=/usr/local/bin/foldseek

Configuring Dali Execution

If you have OpenMPI installed, Dali will run in parallel by default. If not, configure Dali to run serially by modifying the following setting:


# Run Dali in parallel
DALI_NPARA=40

Change to:


DALI_NPARA=1

Worked example

Easy workflow using shell scripts

Linux shell scripts are provided which (1) do database search, (2) add metadata to the results, (3) analyze the results using Pfam clans as reference.

To run the shell scripts, we here set a shorthand environment variable ISShome:

ISShome=~/ISS_ProtSci-1/

ISShome is where the distribution package was installed.

Here, we assume that user inputs are set as environment variables (replace values as appropriate):

pdbfile=$ISShome/testset/pdbmr59.ent.gz cd1=testA minlali=180 zcut=2.0 clan=CL0192

Setting user-defined values for minlali and zcut is optional, as the scripts have reasonable defaults.

pdbfile is the input structure (Query). The file name must be less than 80 charactes long. This is hard-coded in DaliLite.
cd1 must be four characters plus a PDB chain identifier.
minlali is the minimum size of the common core, i.e., how many residues must be structurally equivalent with the query structure. The default is equal to the number of residues assigned to helix or sheet by DSSP. minlali affects the depth of the search.
zcut is another parameter affecting the depth of the search. The default is 2.0, the lowest value reported by Dali.
clan is a Pfam identifier that will be used as positive control in the analysis step. Pfam families (PFddddd) and Pfam clans (CLdddd) are accepted.

Before you start running the shell scripts, prepare your workspace, creating and navigating to a working directory:

mkdir -p mywork cd mywork

The shell scripts can be executed as follows (using the environment variables set above):

$ISShome/scripts/1_search.csh -pdbfile $pdbfile -cd1 $cd1 -minlali $minlali -zcut $zcut > $cd1.log $ISShome/scripts/2_annotate.csh -pdbfile $pdbfile -cd1 $cd1 > $cd1.fragment.stat $ISShome/scripts/3_analyze.csh -clan $clan -cd1 $cd1 -minlali $minlali -zcut $zcut > $cd1.stat

The -minlali and -zcut arguments are optional.
1_search.csh creates the output file $cd1.AFDB2.dali.tsv, which contains the structural alignments by Dali.
2_annotate.csh creates a number of output files:
1. $cd1.AFDB2.tsv adds sequence metadata to the structural alignment data
2. $cd1.AFDB2.seriated.tsv reorders the data to maximize the similarity of secondary structure strings between adjacent rows.
3. $cd1.AFDB2.pf.tsv further adds Pfam annotations.
4. $cd1.fsdirect.tsv are the matches in the first shell by Foldseek. These are used for comparison in the analysis step.
5. $cd1.fsdirect.dali.tsv contains the structural alignments by Dali of the direct Foldseek matches. These are used for comparison in the analysis step.
At the end, this shell script outputs counts of matches found to different Pfam families and clans. Here, the Pfam family assignments annotations are based on hmmsearch of the common core fragments (aligned to the Query in stacked alignment, ignoring insertions), using the trusted cutoff and selecting the Pfam profile that gave the best e-value. The Pfam-A.hmm profile library is used. These counts can be useful in choosing a clan for the analysis step.
3_analyze.csh evaluates the effectiveness of transitive closure in identifying superfamilies. Pfam clans are used as the ground truth, but you need to select a specific clan to use as the reference for the Query. Pfam clans are expert-curated sets of Pfam families that share similar folds and are believed to be evolutionarily related, though multiple sequence profile models are required to represent the diversity among member families. The script outputs data on the intersections within AFDB2 between clan members identified by hmmsearch (using the trusted cutoff), proteins that Dali finds similar to the Query structure (Z > 2), and matches detected by transitive closure or Foldseek. True positive (TP) counts for clan membership are based on the full protein sequence and hmmsearch (using the trusted cutoff) and stored in a pre-computed table (clan_pfam.tsv). This method often results in more Pfam assignments than those found by hmmsearch with sequence fragments in $cd1.AFDB2.pf.tsv.

Precomputed outputs for our test case can be found in $ISShome/testset/case_8/.

Running the steps manually

There is usually no reason to run each step manually. You may want to do this, if one of the shell scripts unexpectedly crashes. Check which output files are missing and continue from the broken step.

Here, we assume that user inputs are set as environment variables (replace values as appropriate):

pdbfile=$ISShome/testset/pdbmr59.ent.gz
cd1=testA
minlali=180
zcut=2.0
clan=CL0192

Step 1: Transitive closure

python3 $ISShome/scripts/ISS_closure.py --pdbfile $pdbfile --pdbid $pdbid --target AFDB2 --minlali $minlali

This creates $cd1.AFDB2.dali.tsv

Step 2.1: Annotate matched proteins with Uniprot accession, taxonomy id, gene name, species, description, amino acid and DSSP (3-state secondary structure) sequences

python3 $ISShome/scripts/ISS_annotate.py --cd1 $cd1 --pdbfile $pdbfile --target AFDB2

This creates $cd1.AFDB2.tsv and $cd1.AFDB2.seriated.tsv

Step 2.2: Assign common core fragments to Pfam families

hmmlib=$ISShome/pfam-data/Pfam-A.hmm CLAN_TABLE=$ISShome/pfam-data/clan_pfam.tsv tmpfasta=$$.fasta tmpout1=$$.out1 tmpout2=$$.out2 grep AFDB $cd1.AFDB2.tsv | cut -f 2,29 | perl -pe 's/^/>/' | perl -pe 's/\t/\n/' > $tmpfasta hmmsearch -o $tmpout1 --acc --noali --notextw --cut_tc $hmmlib $tmpfasta perl $ISShome/scripts/parse_hmmer.pl < $tmpout1 | sort -gk3 | perl $ISShome/scripts/pickfirst.pl > $tmpout2 $ISShome/scripts/tbljoin -l $cd1.AFDB2.tsv $tmpout2 > $tmpout1 $ISShome/scripts/tbljoin -l $tmpout1 $CLAN_TABLE > $cd1.AFDB2.pf.tsv rm -f $tmpfasta $tmpout1 $tmpout2

This creates $cd1.AFDB2.pf.tsv

Step 2.3: Count matched common core fragments per Pfam family

cut -f 33,35 $cd1.AFDB2.pf.tsv | sort | uniq -c | sort  -n

Step 2.4: Do direct Foldseek search for comparison


fs1=$cd1.fsdirect.tsv
fs2=$cd1.fsdirect.dali.tsv
t1=$$.1
# Foldseek run with --max-seqs 50000
python3 $ISShome/scripts/ISS_foldseek.py --pdbfile $pdbfile > $fs1
# validate Foldseek 1st shell (e<1)  by Dali
gawk ' $3 < 1 ' $cd1.fsdirect.tsv | cut -f 2 > $t1
# --pdbfile signals that it is a local file
python3 $ISShome/scripts/ISS_rundali.py --cd1 $cd1 --pdbfile dummy --target $t1 --tsvfile $fs2
# clean up
rm -f $t1

This creates $cd1.fsdirect.tsv (results of Foldseek search) and $cd1.fsdirect.dali.tsv (Dali results on direct Foldseek hits). These files are used for comparing the performance of Foldseek to the transitive closure search.

Step 3.1: Filter clan members that are structurally similar to Query (sensu Dali)

tmplist=$$.list grep $clan clan_hmmer_tc.tsv | cut -f 2 | sort | uniq > $tmplist ## --pdbfile signals that it is local python3 $ISShome/scripts/ISS_rundali.py --cd1 $cd1 --pdbfile dummy --target $tmplist --tsvfile $cd1.$clan.TRUE.dali.tsv rm -f $tmplist

This creates $cd1.clan.TRUE.dali.tsv

Step 3.2: Output statistics

When all input files are in existence, this step is best done calling the shell script:

$ISShome/scripts/3_analyze.csh -clan $clan -cd1 $cd1 -minlali $minlali

Outputs of analysis script

The output is written to a file named $cd1.stat. The command

grep EVAL $cd1.stat

gives a summary table of (TRUE_0, TP_0, P_0, TRUE_1, TP_1, P_1, TRUE_2, TP_2, P_2), where suffix _0 refers to Pfam assignments based on hmmsearch with trusted cutoff, sets _1 are filtered by geometrical siilarity (Dali Z-score > 2), and sets _2 are further filtered by common core size. TRUE is the count of AFDB2 proteins in the Pfam family, TP the count of true positives, and P the count of positives. Each query structure has one row for transitive closure, and several rows for Foldseek at varying e-value cutoffs. Our example gives the following data:

        method          TRUE_0  TP_0   P_0     TRUE_1  TP_1   P_1     TRUE_2  TP_2    P_2
EVAL    closure         10300      0   35165   9339    9295   11318   7544    7543    8091
EVAL    foldseek(10)    10300   5673    8969   9339    3412    3700   7544    2846    3009
EVAL    foldseek(1)     10300   3597    3946   9339    3412    3700   7544    2846    3009
EVAL    foldseek(0.1)   10300    661     684   9339     634     654   7544     543     557
EVAL    foldseek(0.01)  10300     66      67   9339      66      67   7544      63      64
EVAL    foldseek(0.001) 10300     59      60   9339      59      60   7544      58      59

TP_0 (no filtering) of transitive closure is undefined, because it only reports Dali-similar matches. Transitive closure clearly reports more true positives than Foldseek. The numbers in the table enable the caclucation of precision (pr=TP/P), recall (rc=TP/TRUE) and their harmonic mean, the F1-score (F1=2*pr*rc/(pr+rc). The F1-score can be used to compare the performance of binary classifiers. In our example, transitive closure achieves an F1-score of 0.96, which is clearly above Foldseek's 0.59.

Counting protein matches can be biased to large families. The following fields report how many member families of the clan have at least one match (transitive closure search reached all but one family):

Pfam families in clan CL0192: 50
Pfam families in hmmer+dali: 49
Pfam families in closure: 49

Finally, there are Z-score histograms that allow the comparison of transitive closure and Foldseek. Only matches that pass the filter on common core size are included. Both methods find the same proteins with the highest Z-scores, but Foldseek starts dropping geometrically good matches below Z-scores of 18, which implies very strong structural resemblance.

Below, the histograms for our example case are pasted next to each other

Z-score   closure_count   closure_cum_count       foldseek_count   foldseek_cum_count
42      1       1       1       1
36      4       5       4       5
35      4       9       4       9
34      2       11      2       11
33      3       14      3       14
31      2       16      2       16
30      15      31      15      31
29      8       39      8       39
28      7       46      7       46
27      6       52      6       52
26      2       54      2       54
25      1       55      1       55
24      3       58      3       58
23      1       59      1       59
17      2       61      0       59
16      38      99      10      69
15      43      142     8       77
14      69      211     20      97
13      116     327     25      122
12      297     624     68      190
11      799     1423    218     408
10      1800    3223    658     1066
9       2835    6058    1244    2310
8       1743    7801    621     2931
7       233     8034    71      3002
6       47      8081    6       3008
5       8       8089    1       3009
4       1       8090    0       3009
3       1       8091    0       3009

Plots

Histograms of Z-scores

The histograms are found at the end of $cd1.stat. In Linux, the data blocks can be copy-pasted to files and merged:

gawk -v minlali=$minlali ' $5 > minlali ' $cd1.$target.dali.tsv | cut -f 3 | python3 $ISShome/scripts/histcol.py > x gawk -v minlali=$minlali ' $5 > minlali ' $cd1.fsdirect.dali.tsv | cut -f 3 | python3 $ISShome/scripts/histcol.py > y # convert to tsv, select columns to plot paste x y| perl -pe 's/ +/\t/g' | cut -f 1,2,6

The first column is the Z-score bin, the second is counts for transitive closure search, the third is counts for direct Foldseek search. Counts are of matched proteins with at least $minlen structurally equivalent positions with the Query structure. The difference between the closure and Foldseek curves in a scatterplot shows the gain by transitive search.

Scatterplot of alignment length vs. Dali Z-score

This Linux command extracts the Z-score, alignment length and Pfam labels:

cut -f 4,6,33,35 $cd1.AFDB2.pf.tsv

Use the labels to group the data into data series representing different Pfam families or clans.

Stacked alignments

Stacked alignments can be extracted from $cd1.AFDB2.tsv (or $cd1.AFDB2.pf.tsv) and displayed as text or converted to Fasta format for display in a multiplse sequence alignent viewer. Consult Appendix A for filtering options to select subsets of data.

# Secondary structure strings in stacked alignment gawk ' $2 == "AFDB2" ' $cd1.AFDB2.tsv | cut -f 30 # The same as above filtered on alignment length gawk -v minlali=$minlali ' $2 == "AFDB2" && $6 > $minlali ' $cd1.AFDB2.tsv | cut -f 30 # Seriated data looks nicer gawk ' $2 == "AFDB2" ' $cd1.AFDB2.seriated.tsv | cut -f 30 # Fasta formatted sequences in stacked alignment gawk ' $2 == "AFDB2" ' $cd1.AFDB2.tsv | cut -f 2,31 | perl -pe 's/^/>/' | perl -pe 's/^t/\n/'

Superimposed coordinates

Running the transitive closure program only requires a locally installed Dali Digest of AFDB2, not the original PDB files in the AlphaFold Database. If you want to work with a AFDB2 model found as a match, you can do the following steps:

Download a target model from AlphaFold Database: get the Uniprot accession number from column 9 of $cd1.AFDB2.tsv, go to https://www.uniprot.org/uniprotkb/, paste the accession number in the search box, navigate to Structure section of the entry page, click on AlphaFold download icon. This gives you a PDB formatted file.
Get the rotation matrix U and translation vector T from columns 25 and 26 of $cd1.AFDB2.tsv, respectively.
Apply the transformation UX+T to target model coordinates X. This superimposes the target model onto the Query structure in a least squares fit of the structurally equivalent CA atoms. In our example, let's say n5nvA is an interesting match:
cd1=testA cd2=n5nvA # replace file name by your downloaded file afdbfile=$ISShome/n5nv.pdb gawk -v cd2=$cd2 ' $2 == $cd2 ' $cd1.AFDB2.tsv | cut -f 25,26 | perl $ISShome/scripts/ISS_applymatrix.pl $afdbfile > $cd2.transformed.pdb

Now you can load the Query structure and $cd2.transformed.pdb in your favourite molecular graphics viewer. In our example, column 5 in $cd1.AFDB2.tsv tells that the Rmsd is 1.8 A, which will look nice. Note that Dali does not optimize Rmsd, and some structural alignments yield high Rmsd values (rmsd >5 A definitely look ugly in superimposition).

Benchmark results

Results for twelve test cases are found under $ISShome/testset/. The results in subdirectories $ISShome/testset/case_*/ were generated using the script run_testset.csh. There is no need to rerun this script, as benchmark results are pre-computed.

For each case, lines of the file testA.stat starting with "EVAL" give a summary of (TRUE_0, TP_0, P_0, TRUE_1, TP_1, P_1, TRUE_2, TP_2, P_2), where suffix _0 refers to Pfam assignments based on hmmsearch with trusted cutoff, sets _1 are filtered by geometrical siilarity (Dali Z-score > 2), and sets _2 are further filtered by common core size. TRUE means ground truth, TP means true positive and P means positive. Each query structure has one row for transitive closure, and several rows for Foldseek at varying e-value cutoffs.

Troubleshooting

Check configuration file ISS.ini

The Docker image should work as is. However, if ISS_ProtSci was installed manually, there are hardcoded paths in the configuration file ISS.ini, which should have been edited to correspond to your local setup: ISS_SCRIPT_HOME (where ISS_ProtSci is installed), DALI_HOME (where DaliLite is installed), and DALI_AFDB_DAT_UNCROPPED (location of Dali Digest data set of AlphaFold.v2 models imported to DaliLite).

The search step requires three external programs, which must be installed on your system: DaliLite, Foldseek, and a DSSP program called from within DaliLite (see note on DSSP versions on download page). The executables of DaliLite and Foldseek are defined in ISS.ini as DALI_EXE and FOLDSEEK_EXE.

Check connection to data server

The search requires a working Internet connection to access the remote data server. Test the connection:

python3 /path/to/ISS_ProtSci-1/scripts/ISS_remote_sql.py

Output should be similar to:

# default_config =  /data/liisa/ISS_ProtSci-1/scripts/ISS.ini
get_afdb2_cropaln ["1,26,56,62", "1,30,64,74", "25,30,6,94", 155]
afdb_repset ( 100 ) ['fbuzA', 'gyskA'] ... ['d9zaA', 'dpixA']
get_repset ['fbuzA', 'gyskA', 'c402A', 'fquwA', 'no7sA']
get_repset_straight_list ["fbuzA", "gyskA", "c402A", "fquwA", "no7sA"]
get_repset_straight_list_inf ["fbuzA", "gyskA", "c402A", "fquwA", "no7sA"]
annotate_afdb_resultlist[0]:
         fbuzA
         Q96N28
         PLD3A_HUMAN
         172
         PRELI domain containing protein 3A
         9606
         PRELID3A
         Homo sapiens
         MKIWSSEHVFGHPWDTVIQAAMRKYPNPMNPSVLGVDVLQRRVDGRGRLHSLRLLSTEWGLPSLVRAILGTSRTLTYIREHSVVDPVEKKMELCSTNITLTNLVSVNERLVYTPHPENPEMTVLTQEAIITVKGISLGSYLESLMANTISSNAKKGWAAIEWIIEHSESAVS
         LEEEEEEEEELLLHHHHHHHHHLLLLLLLLLLEEEEEEEEEEELLLLLEEEEEEEEEELLLLHHHHHHHLLLLLEEEEEEEEEEELLLLEEEEEEEELLLLLLEEEEEEEEEEELLLLLLLEEEEEEEEEEELLLLLHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHLLL
         1,26,56,62
         1,30,64,74
         25,30,6,94
         155
         1
shell_reps ( 1236 ) [['no7sA', 'no7sA', 0.0, 1, 476, 1, 476]] ... [['gyskA', 'iktvA', 9.237e-06, 3, 156, 219, 370]]
list_to_evalues [["fquwA", 1.282e-05], ["gyskA", 1.02e-05], ["fbuzA", 2.045e-29], ["c402A", 1.08e-05], ["no7sA", 2.025e-05]]
filter_list_for_targets ["fquwA", "gyskA", "fbuzA", "c402A", "no7sA"]

Check Foldseek

Pre-indexed databases used by Foldseek are included in the ISS_ProtSci distribution package. The location of the Foldseek executable is defined by FOLDSEEK_EXE in the ISS.ini file.

Check hmmer3

The annotation and analysis steps require hmmsearch of the hmmer3 package, and associated data files.

Check hmmsearch is in your path:

which hmmsearch

/usr/local/bin/hmmsearch

hmmsearch

Incorrect number of command line arguments.
Usage: hmmsearch [options]  

where most common options are:
  -h : show brief help on version and usage

To see more help on available options, do hmmsearch -h

Check that required data files are present. These are included in the ISS_ProtSci distribution package.

cd /path/to/ISS_ProtSci-1/pfam-data/

ls -s clan_pfam.tsv Pfam-A.hmm clan_hmmer_tc.tsv

  38900 clan_hmmer_tc.tsv      144 clan_pfam.tsv  1535348 Pfam-A.hmm

Check DaliLite components

The location of DaliLite is defined by DALI_HOME in the ISS.ini file. DaliLite can be tested by running the script test.csh.

The most common problem with DaliLite installations is an incompatible DSSP version (see section 2.3 of DaliLite web page).

Appendix A: Output format of {cd1}.AFDB2.tsv and {cd1}.AFDB2.pf.tsv

All *.tsv outputs contain tab-separated values with a header line. {cd1}.AFDB2.tsv includes columns 1-32 and {cd1}.AFDB2.pf.tsv includes columns 1-35. Blank rows indicate columns reserved for other use in the future.

Column Heading Description
1 Rank Rank of match in descending order of Dali Z-scores
2 Sbjct Dali identifier of matched structure (sbjct) in AFDB2
3 Database Query/AFDB. Flags special data for query protein which is needed for html output.
4 Z-score Dali Z-score. Higher values are better.
5 RMSD Root-mean-square deviation of superimposed CA atoms. It is not optimized by Dali. High values don't look good in 3D-viewer.
6 Ali-length Number of structurally equivalent CA atoms in the Dali alignment.
7
8 Seq-identity Percentage of identical amino acids of structurally aligned residues.
9 Accession Uniprot accession number of AlphaFold model
10 Protein-identifier Uniprot protein identifier of AlphaFold model
11 Sbjct-length Number of amino acids of sbjct sequence in Uniprot
12 Description Short description of sbjct protein from Uniprot
13 Taxid NCBI taxonomy id of sbjct protein
14 Gene Sbjct protein's gene name from Uniprot
15 Species Sbjct protein's species from Uniprot
16 Query Query identifier
17 Query-length Number of amino acids in query structure
18 Query-coverage Ali-length divided by Query-length
19 Sbjct-coverage Ali-length divided by Sbjct-length-cropped
20
21 Foldseek-evalue E-value of direct Foldseek search with max-seqs 50000
22 qstarts Start positions of ungapped aligned segments in query
23 sstarts Start positions of ungapped aligned segments in sbjct
24 lengths Lengths of ungapped aligned segments
25 rotation 3x3 translation matrix U
26 translation translation vector T. UX+T superimposes sbjct coordinates X onto the query structure.
27 Sbjct-sequence Amino acid sequence of sbjct
28 Sbjct-dssp DSSP assignments for sbjct. H = alpha helix, E = beta strand, L = coil
29 sequ-pileup How sbjct-sequence appears in stacked alignment, ignoring insertions
30 DSSP-pileup How sbjct-dssp appears in stacked alignment, ignoring insertions
31 dssp-order Rank of match in seriated view, which maximizes similarity of secondary structure between adjacent rows
32
33 pfam Pfam identifier of profile that gave the best evalue for sequ-pileup in hmmsearch against Pfam-A.hmm
34 hmmer_evalue evalue from hmmsearch
35 clan Clan of protein family (pfam, col. 33) as assigned in Pfam-C. NA if unassigned.

Appendix B: Data server

The standalone Python script uses a remote data server in Helsinki, which is set up for searching the AFDB2 database using Dali identifiers as keys. The identifiers are the same as in the preprocessed AlphaFold database for DaliLite. If you should want to set up your own data server, this section explains the schema of the data tables. It is up to you to populate the tables with data, as you would be using your own keys. DaliLite requires that structure identifiers (used as keys) are five characters long, composed of a four-letter unique identifier and the chain identifier.

The server runs scripts/ISSserver.py as a daemon process. To use the local data server, import ISS_sql.py (found in scripts/) instead of ISS_remote_sql.py in all ISS_* scripts.

The data server uses two sqlite3 databases. Their location is hardcoded in the select_action function of ISSserver.py:

p['sqlite3_database']='/data/liisa/ISS/ISS.db'
p['bigdb_database']='/data/liisa/alphafold/TESS/test.db'

Replace the paths according to your local configuration.

Data to set up a lightweigt version of the Helsinki data server using Dali Digest keys is available:

mini_afdb_afdb.tsv.gz (20 GB): Foldseek all-against-all comparison data used to look up neighbor shells (table foldseek_afdb_afdb), restricted to pairs with Foldseek e-values below 0.01. Propagation during the search explores neighbor shells with e-values less than 0.01. Thus, this reduced data table can be used with the shell script 1_search.csh which uses PDB formatted files as query - in this case, entry points are generated by running Foldseek with the query structure with the e-value threshold set to 1. Warning: The Python script ISS_closure.py also accepts Dali-identifiers as query, but their entry points are looked up in the precomputed database using an e-value threshold of 1, but data with e-values between 0.01 and 1 are missing from the mini-table.
afdb2_metadata.tsv.gz: data for table afdb2_metadata
dbref.tsv.gz: data for table dbref
dssp.tsv.gz: data for table afdb2_dssp
sequ.tsv.gz: data for table afdb2_sequ
species_taxid.tsv.gz: data for table species_taxid

(Open an sqlite3 session with appropriate database, create tables as defined below,import data from the decompressed tsv files, create indices. )

The standalone script expects the following table names with content as described.

Tables in bigdb_database

foldseek_afdb_to_afdb

This table contains the all-against-all comparison data used to look up neighbor shells. We used Foldseek for AFDB2, but other programs could be used.

field type comment

cd1 char(5), indexed Dali structure identifier of 'query' structure

cd2 char(5) Dali structure identifier of 'sbjct' structure

qfrom integer start of 'query' alignment

qto integer end of 'query' alignment

sfrom integer start of 'sbjct' alignment

sto integer end of 'sbjct' alignment

evalue float a score used to rank neighbors; smaller scores have higher priority

field	type	comment
cd1	char(5), indexed	Dali structure identifier of 'query' structure
cd2	char(5)	Dali structure identifier of 'sbjct' structure
qfrom	integer	start of 'query' alignment
qto	integer	end of 'query' alignment
sfrom	integer	start of 'sbjct' alignment
sto	integer	end of 'sbjct' alignment
evalue	float	a score used to rank neighbors; smaller scores have higher priority

The all-against-all data was generated using Foldseek command:

foldseek easy-search pdb-directory afdb2 result.m8 tmp -e 1 --max-seqs 50000

The AFDB2 run took 5 days with 40 CPUs.

Tables in sqlite3_database

This database is accessed by the function ISS_annotate.annotate_afdb_resultlist() and the data are added as annotations to the 'sbjct' structure

dbref

AlphaFold Database version 2 models are generated from Uniprot sequences. This table maps Dali identifiers to Uniprot identifiers.

field type comment

cd1 char(5) primary key Dali identifier

nres integer length of the protein

acc varchar(20) Uniprot's accession number

pid varchar(20) Uniprot's protein identifier

field	type	comment
cd1	char(5) primary key	Dali identifier
nres	integer	length of the protein
acc	varchar(20)	Uniprot's accession number
pid	varchar(20)	Uniprot's protein identifier

afdb2_sequ

This table is used to generate stacked alignments of 'sbjct' proteins against the 'query' protein. It was generated using data from DaliLite's internal protein structure representation (DAT files).

field type comment

cd1 char(5), primary key Dali identifier

sequence blob amino acid sequence

field	type	comment
cd1	char(5), primary key	Dali identifier
sequence	blob	amino acid sequence

afdb2_dssp

This table is used to generate stacked alignments of 'sbjct' proteins against the 'query' protein. It was generated using data from DaliLite's internal protein structure representation (DAT files).

field type comment

cd1 char(5), primary key Dali identifier

dsspstring blob sequence of three-state assignments of secondary structure

field	type	comment
cd1	char(5), primary key	Dali identifier
dsspstring	blob	sequence of three-state assignments of secondary structure

afdb2_metadata

This table parsed tags on the header line of UniprotKB's sequence files in Fasta format.

field type comment

cd1 char(5), primary key Dali identifier

acc varchar(20) Uniprot's unique accession number

pid varchar(20) Uniprot's protein identifier

desc blob Uniprot's short description of the protein

taxid integer NCBI taxonomy identifier

gene varchar(20) Uniprot's gene symbol

field	type	comment
cd1	char(5), primary key	Dali identifier
acc	varchar(20)	Uniprot's unique accession number
pid	varchar(20)	Uniprot's protein identifier
desc	blob	Uniprot's short description of the protein
taxid	integer	NCBI taxonomy identifier
gene	varchar(20)	Uniprot's gene symbol

species_taxid

field	type	comment
species	blob	scientific name
taxid	integer, primary key	NCBI taxonomy identifier

User manual for ISS_ProtSci

Table of contents

Installation via Docker

Prerequisites

Instructions

1.Pull the Docker Image

2.Run the Container and download necessary databases

3.Activate the Conda Environment

4. Run a Test Case

Note on Parallel Processing

Verifying the Results

Manual Installation

System Requirements

Notes on DaliLite Installation

Downloads

Configuration

Configuring Dali Execution

Worked example

Easy workflow using shell scripts

Running the steps manually

Step 1: Transitive closure

Step 2.1: Annotate matched proteins with Uniprot accession, taxonomy id, gene name, species, description, amino acid and DSSP (3-state secondary structure) sequences

Step 2.2: Assign common core fragments to Pfam families

Step 2.3: Count matched common core fragments per Pfam family

Step 2.4: Do direct Foldseek search for comparison

Step 3.1: Filter clan members that are structurally similar to Query (sensu Dali)

Step 3.2: Output statistics

Outputs of analysis script

Plots

Histograms of Z-scores

Scatterplot of alignment length vs. Dali Z-score

Stacked alignments

Superimposed coordinates

Benchmark results

Troubleshooting

Check configuration file ISS.ini

Check connection to data server

Check Foldseek

Check hmmer3

Check DaliLite components

Appendix A: Output format of {cd1}.AFDB2.tsv and {cd1}.AFDB2.pf.tsv

Appendix B: Data server

Tables in bigdb_database

foldseek_afdb_to_afdb

Tables in sqlite3_database

dbref

afdb2_sequ

afdb2_dssp

afdb2_metadata

species_taxid