A tutorial is available.
Citation:
1. Liu et al. (2025) 3-D substructure search by transitive closure in AlphaFold Database. Protein Science 34:e70169
Installation via Docker Manual installation System requirements Downloads Configuration Worked example Easy workflow using shell scripts Running the steps manually Outputs of analysis script Plots Superimposed coordinates Benchmark results Troubleshooting Appendix A: Output format of {cd1}.AFDB2.tsv and {cd1}.AFDB2.pf.tsv Appendix B: Data server
docker pull hao1999/iss_protsci
docker run -it --name iss_protsci -v iss_data:/data hao1999/iss_protsci:latest
source /opt/miniconda/bin/activate iss_env
/root/ISS_ProtSci-1/0_test.csh -o hcase_1
If you are unable to use Docker, you can follow the steps below to manually set up the environment. Please be aware that setting up the environment manually requires attention to the versions of the dependencies, and this process can be time-consuming.
numpy
, scipy
, argparse
, configparser
, heapdict
, bitstring
dsspcmbi
, which is included in the DaliLite package but may not compile correctly with modern C compilers. It can be replaced by mkdssp as insructed on the DaliLite download page.gcc 4.8.5
and mpirun (Open MPI) 1.10.7
.
# Download scripts, Foldseek and Pfam data, and benchmark results (compressed size: 3 GB)
wget http://ekhidna2.biocenter.helsinki.fi/ISS_ProtSci/ISS_ProtSci.tar.gz
tar -zxvf ISS_ProtSci.tar.gz
# Navigate to the installation directory
cd ISS_ProtSci-1
# Check installation
python3 scripts/ISS_closure.py
# Expected output:
# default_config = /home/luholm/ISS_ProtSci-1/scripts/ISS.ini
# class Query: None None None
# dowo, dofs False True
# Download AFDB2 database for DaliLite
wget http://ekhidna2.biocenter.helsinki.fi/dali/AF-Digest.tar.gz
tar -zxvf AF-Digest.tar.gz
Site-specific configuration is necessary.
Edit scripts/ISS.ini
. Find the following lines and update the paths according to your local installation:
# ISS installation directory
ISS_SCRIPT_HOME=/home/luholm/ISS_ProtSci-1/
# Dali installation directory
DALI_HOME=/home/luholm/DaliLite.v5/
# Location of AFDB2 database for DaliLite
DALI_AFDB_DAT_UNCROPPED=/data/liisa/alphafold/DAT/
# Foldseek executable
FOLDSEEK_EXE=/usr/local/bin/foldseek
If you have OpenMPI installed, Dali will run in parallel by default. If not, configure Dali to run serially by modifying the following setting:
# Run Dali in parallel
DALI_NPARA=40
Change to:
DALI_NPARA=1
To run the shell scripts, we here set a shorthand environment variable ISShome:
ISShome=~/ISS_ProtSci-1/
ISShome is where the distribution package was installed.
Here, we assume that user inputs are set as environment variables (replace values as appropriate):
pdbfile=$ISShome/testset/pdbmr59.ent.gz
cd1=testA
minlali=180
zcut=2.0
clan=CL0192
Setting user-defined values for minlali and zcut is optional, as the scripts have reasonable defaults.
Before you start running the shell scripts, prepare your workspace, creating and navigating to a working directory:
mkdir -p mywork
cd mywork
The shell scripts can be executed as follows (using the environment variables set above):
$ISShome/scripts/1_search.csh -pdbfile $pdbfile -cd1 $cd1 -minlali $minlali -zcut $zcut > $cd1.log
$ISShome/scripts/2_annotate.csh -pdbfile $pdbfile -cd1 $cd1 > $cd1.fragment.stat
$ISShome/scripts/3_analyze.csh -clan $clan -cd1 $cd1 -minlali $minlali -zcut $zcut > $cd1.stat
Precomputed outputs for our test case can be found in $ISShome/testset/case_8/.
Here, we assume that user inputs are set as environment variables (replace values as appropriate):
pdbfile=$ISShome/testset/pdbmr59.ent.gz cd1=testA minlali=180 zcut=2.0 clan=CL0192
python3 $ISShome/scripts/ISS_closure.py --pdbfile $pdbfile --pdbid $pdbid --target AFDB2 --minlali $minlali
This creates $cd1.AFDB2.dali.tsv
python3 $ISShome/scripts/ISS_annotate.py --cd1 $cd1 --pdbfile $pdbfile --target AFDB2
This creates $cd1.AFDB2.tsv and $cd1.AFDB2.seriated.tsv
hmmlib=$ISShome/pfam-data/Pfam-A.hmm
CLAN_TABLE=$ISShome/pfam-data/clan_pfam.tsv
tmpfasta=$$.fasta
tmpout1=$$.out1
tmpout2=$$.out2
grep AFDB $cd1.AFDB2.tsv | cut -f 2,29 | perl -pe 's/^/>/' | perl -pe 's/\t/\n/' > $tmpfasta
hmmsearch -o $tmpout1 --acc --noali --notextw --cut_tc $hmmlib $tmpfasta
perl $ISShome/scripts/parse_hmmer.pl < $tmpout1 | sort -gk3 | perl $ISShome/scripts/pickfirst.pl > $tmpout2
$ISShome/scripts/tbljoin -l $cd1.AFDB2.tsv $tmpout2 > $tmpout1
$ISShome/scripts/tbljoin -l $tmpout1 $CLAN_TABLE > $cd1.AFDB2.pf.tsv
rm -f $tmpfasta $tmpout1 $tmpout2
This creates $cd1.AFDB2.pf.tsv
cut -f 33,35 $cd1.AFDB2.pf.tsv | sort | uniq -c | sort -n
fs1=$cd1.fsdirect.tsv
fs2=$cd1.fsdirect.dali.tsv
t1=$$.1
# Foldseek run with --max-seqs 50000
python3 $ISShome/scripts/ISS_foldseek.py --pdbfile $pdbfile > $fs1
# validate Foldseek 1st shell (e<1) by Dali
gawk ' $3 < 1 ' $cd1.fsdirect.tsv | cut -f 2 > $t1
# --pdbfile signals that it is a local file
python3 $ISShome/scripts/ISS_rundali.py --cd1 $cd1 --pdbfile dummy --target $t1 --tsvfile $fs2
# clean up
rm -f $t1
This creates $cd1.fsdirect.tsv (results of Foldseek search) and $cd1.fsdirect.dali.tsv (Dali results on direct Foldseek hits). These files are used for comparing the performance of Foldseek to the transitive closure search.
tmplist=$$.list
grep $clan clan_hmmer_tc.tsv | cut -f 2 | sort | uniq > $tmplist
## --pdbfile signals that it is local
python3 $ISShome/scripts/ISS_rundali.py --cd1 $cd1 --pdbfile dummy --target $tmplist --tsvfile $cd1.$clan.TRUE.dali.tsv
rm -f $tmplist
This creates $cd1.clan.TRUE.dali.tsv
$ISShome/scripts/3_analyze.csh -clan $clan -cd1 $cd1 -minlali $minlali
grep EVAL $cd1.stat
gives a summary table of (TRUE_0, TP_0, P_0, TRUE_1, TP_1, P_1, TRUE_2, TP_2, P_2), where suffix _0 refers to Pfam assignments based on hmmsearch with trusted cutoff, sets _1 are filtered by geometrical siilarity (Dali Z-score > 2), and sets _2 are further filtered by common core size. TRUE is the count of AFDB2 proteins in the Pfam family, TP the count of true positives, and P the count of positives. Each query structure has one row for transitive closure, and several rows for Foldseek at varying e-value cutoffs. Our example gives the following data:
method TRUE_0 TP_0 P_0 TRUE_1 TP_1 P_1 TRUE_2 TP_2 P_2 EVAL closure 10300 0 35165 9339 9295 11318 7544 7543 8091 EVAL foldseek(10) 10300 5673 8969 9339 3412 3700 7544 2846 3009 EVAL foldseek(1) 10300 3597 3946 9339 3412 3700 7544 2846 3009 EVAL foldseek(0.1) 10300 661 684 9339 634 654 7544 543 557 EVAL foldseek(0.01) 10300 66 67 9339 66 67 7544 63 64 EVAL foldseek(0.001) 10300 59 60 9339 59 60 7544 58 59TP_0 (no filtering) of transitive closure is undefined, because it only reports Dali-similar matches. Transitive closure clearly reports more true positives than Foldseek. The numbers in the table enable the caclucation of precision (pr=TP/P), recall (rc=TP/TRUE) and their harmonic mean, the F1-score (F1=2*pr*rc/(pr+rc). The F1-score can be used to compare the performance of binary classifiers. In our example, transitive closure achieves an F1-score of 0.96, which is clearly above Foldseek's 0.59.
Counting protein matches can be biased to large families. The following fields report how many member families of the clan have at least one match (transitive closure search reached all but one family):
Below, the histograms for our example case are pasted next to each other
Z-score closure_count closure_cum_count foldseek_count foldseek_cum_count 42 1 1 1 1 36 4 5 4 5 35 4 9 4 9 34 2 11 2 11 33 3 14 3 14 31 2 16 2 16 30 15 31 15 31 29 8 39 8 39 28 7 46 7 46 27 6 52 6 52 26 2 54 2 54 25 1 55 1 55 24 3 58 3 58 23 1 59 1 59 17 2 61 0 59 16 38 99 10 69 15 43 142 8 77 14 69 211 20 97 13 116 327 25 122 12 297 624 68 190 11 799 1423 218 408 10 1800 3223 658 1066 9 2835 6058 1244 2310 8 1743 7801 621 2931 7 233 8034 71 3002 6 47 8081 6 3008 5 8 8089 1 3009 4 1 8090 0 3009 3 1 8091 0 3009
gawk -v minlali=$minlali ' $5 > minlali ' $cd1.$target.dali.tsv | cut -f 3 | python3 $ISShome/scripts/histcol.py > x
gawk -v minlali=$minlali ' $5 > minlali ' $cd1.fsdirect.dali.tsv | cut -f 3 | python3 $ISShome/scripts/histcol.py > y
# convert to tsv, select columns to plot
paste x y| perl -pe 's/ +/\t/g' | cut -f 1,2,6
The first column is the Z-score bin, the second is counts for transitive closure search, the third is counts for direct Foldseek search. Counts are of matched proteins with at least $minlen structurally equivalent positions with the Query structure. The difference between the closure and Foldseek curves in a scatterplot shows the gain by transitive search.
cut -f 4,6,33,35 $cd1.AFDB2.pf.tsv
Use the labels to group the data into data series representing different Pfam families or clans.
# Secondary structure strings in stacked alignment
gawk ' $2 == "AFDB2" ' $cd1.AFDB2.tsv | cut -f 30
# The same as above filtered on alignment length
gawk -v minlali=$minlali ' $2 == "AFDB2" && $6 > $minlali ' $cd1.AFDB2.tsv | cut -f 30
# Seriated data looks nicer
gawk ' $2 == "AFDB2" ' $cd1.AFDB2.seriated.tsv | cut -f 30
# Fasta formatted sequences in stacked alignment
gawk ' $2 == "AFDB2" ' $cd1.AFDB2.tsv | cut -f 2,31 | perl -pe 's/^/>/' | perl -pe 's/^t/\n/'
cd1=testA
cd2=n5nvA
# replace file name by your downloaded file
afdbfile=$ISShome/n5nv.pdb
gawk -v cd2=$cd2 ' $2 == $cd2 ' $cd1.AFDB2.tsv | cut -f 25,26 | perl $ISShome/scripts/ISS_applymatrix.pl $afdbfile > $cd2.transformed.pdb
For each case, lines of the file testA.stat starting with "EVAL" give a summary of (TRUE_0, TP_0, P_0, TRUE_1, TP_1, P_1, TRUE_2, TP_2, P_2), where suffix _0 refers to Pfam assignments based on hmmsearch with trusted cutoff, sets _1 are filtered by geometrical siilarity (Dali Z-score > 2), and sets _2 are further filtered by common core size. TRUE means ground truth, TP means true positive and P means positive. Each query structure has one row for transitive closure, and several rows for Foldseek at varying e-value cutoffs.
The search step requires three external programs, which must be installed on your system: DaliLite, Foldseek, and a DSSP program called from within DaliLite (see note on DSSP versions on download page). The executables of DaliLite and Foldseek are defined in ISS.ini as DALI_EXE and FOLDSEEK_EXE.
python3 /path/to/ISS_ProtSci-1/scripts/ISS_remote_sql.py
Output should be similar to:
# default_config = /data/liisa/ISS_ProtSci-1/scripts/ISS.ini get_afdb2_cropaln ["1,26,56,62", "1,30,64,74", "25,30,6,94", 155] afdb_repset ( 100 ) ['fbuzA', 'gyskA'] ... ['d9zaA', 'dpixA'] get_repset ['fbuzA', 'gyskA', 'c402A', 'fquwA', 'no7sA'] get_repset_straight_list ["fbuzA", "gyskA", "c402A", "fquwA", "no7sA"] get_repset_straight_list_inf ["fbuzA", "gyskA", "c402A", "fquwA", "no7sA"] annotate_afdb_resultlist[0]: fbuzA Q96N28 PLD3A_HUMAN 172 PRELI domain containing protein 3A 9606 PRELID3A Homo sapiens MKIWSSEHVFGHPWDTVIQAAMRKYPNPMNPSVLGVDVLQRRVDGRGRLHSLRLLSTEWGLPSLVRAILGTSRTLTYIREHSVVDPVEKKMELCSTNITLTNLVSVNERLVYTPHPENPEMTVLTQEAIITVKGISLGSYLESLMANTISSNAKKGWAAIEWIIEHSESAVS LEEEEEEEEELLLHHHHHHHHHLLLLLLLLLLEEEEEEEEEEELLLLLEEEEEEEEEELLLLHHHHHHHLLLLLEEEEEEEEEEELLLLEEEEEEEELLLLLLEEEEEEEEEEELLLLLLLEEEEEEEEEEELLLLLHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHLLL 1,26,56,62 1,30,64,74 25,30,6,94 155 1 shell_reps ( 1236 ) [['no7sA', 'no7sA', 0.0, 1, 476, 1, 476]] ... [['gyskA', 'iktvA', 9.237e-06, 3, 156, 219, 370]] list_to_evalues [["fquwA", 1.282e-05], ["gyskA", 1.02e-05], ["fbuzA", 2.045e-29], ["c402A", 1.08e-05], ["no7sA", 2.025e-05]] filter_list_for_targets ["fquwA", "gyskA", "fbuzA", "c402A", "no7sA"]
Check hmmsearch is in your path:
which hmmsearch
/usr/local/bin/hmmsearch
hmmsearch
Incorrect number of command line arguments. Usage: hmmsearch [options]where most common options are: -h : show brief help on version and usage To see more help on available options, do hmmsearch -h
Check that required data files are present. These are included in the ISS_ProtSci distribution package.
ls -s clan_pfam.tsv Pfam-A.hmm clan_hmmer_tc.tsv
cd /path/to/ISS_ProtSci-1/pfam-data/
38900 clan_hmmer_tc.tsv 144 clan_pfam.tsv 1535348 Pfam-A.hmm
The most common problem with DaliLite installations is an incompatible DSSP version (see section 2.3 of DaliLite web page).
Column Heading Description 1 Rank Rank of match in descending order of Dali Z-scores 2 Sbjct Dali identifier of matched structure (sbjct) in AFDB2 3 Database Query/AFDB. Flags special data for query protein which is needed for html output. 4 Z-score Dali Z-score. Higher values are better. 5 RMSD Root-mean-square deviation of superimposed CA atoms. It is not optimized by Dali. High values don't look good in 3D-viewer. 6 Ali-length Number of structurally equivalent CA atoms in the Dali alignment. 7 8 Seq-identity Percentage of identical amino acids of structurally aligned residues. 9 Accession Uniprot accession number of AlphaFold model 10 Protein-identifier Uniprot protein identifier of AlphaFold model 11 Sbjct-length Number of amino acids of sbjct sequence in Uniprot 12 Description Short description of sbjct protein from Uniprot 13 Taxid NCBI taxonomy id of sbjct protein 14 Gene Sbjct protein's gene name from Uniprot 15 Species Sbjct protein's species from Uniprot 16 Query Query identifier 17 Query-length Number of amino acids in query structure 18 Query-coverage Ali-length divided by Query-length 19 Sbjct-coverage Ali-length divided by Sbjct-length-cropped 20 21 Foldseek-evalue E-value of direct Foldseek search with max-seqs 50000 22 qstarts Start positions of ungapped aligned segments in query 23 sstarts Start positions of ungapped aligned segments in sbjct 24 lengths Lengths of ungapped aligned segments 25 rotation 3x3 translation matrix U 26 translation translation vector T. UX+T superimposes sbjct coordinates X onto the query structure. 27 Sbjct-sequence Amino acid sequence of sbjct 28 Sbjct-dssp DSSP assignments for sbjct. H = alpha helix, E = beta strand, L = coil 29 sequ-pileup How sbjct-sequence appears in stacked alignment, ignoring insertions 30 DSSP-pileup How sbjct-dssp appears in stacked alignment, ignoring insertions 31 dssp-order Rank of match in seriated view, which maximizes similarity of secondary structure between adjacent rows 32 33 pfam Pfam identifier of profile that gave the best evalue for sequ-pileup in hmmsearch against Pfam-A.hmm 34 hmmer_evalue evalue from hmmsearch 35 clan Clan of protein family (pfam, col. 33) as assigned in Pfam-C. NA if unassigned.
The server runs scripts/ISSserver.py as a daemon process. To use the local data server, import ISS_sql.py (found in scripts/) instead of ISS_remote_sql.py in all ISS_* scripts.
The data server uses two sqlite3 databases. Their location is hardcoded in the select_action function of ISSserver.py:
p['sqlite3_database']='/data/liisa/ISS/ISS.db'
p['bigdb_database']='/data/liisa/alphafold/TESS/test.db'
Replace the paths according to your local configuration.
Data to set up a lightweigt version of the Helsinki data server using Dali Digest keys is available:
The standalone script expects the following table names with content as described.
field | type | comment |
---|---|---|
cd1 | char(5), indexed | Dali structure identifier of 'query' structure |
cd2 | char(5) | Dali structure identifier of 'sbjct' structure |
qfrom | integer | start of 'query' alignment |
qto | integer | end of 'query' alignment |
sfrom | integer | start of 'sbjct' alignment |
sto | integer | end of 'sbjct' alignment |
evalue | float | a score used to rank neighbors; smaller scores have higher priority |
The all-against-all data was generated using Foldseek command:
foldseek easy-search pdb-directory afdb2 result.m8 tmp -e 1 --max-seqs 50000
The AFDB2 run took 5 days with 40 CPUs.
field | type | comment |
---|---|---|
cd1 | char(5) primary key | Dali identifier |
nres | integer | length of the protein |
acc | varchar(20) | Uniprot's accession number |
pid | varchar(20) | Uniprot's protein identifier |
field | type | comment |
---|---|---|
cd1 | char(5), primary key | Dali identifier |
sequence | blob | amino acid sequence |
field | type | comment |
---|---|---|
cd1 | char(5), primary key | Dali identifier |
dsspstring | blob | sequence of three-state assignments of secondary structure |
field | type | comment |
---|---|---|
cd1 | char(5), primary key | Dali identifier |
acc | varchar(20) | Uniprot's unique accession number |
pid | varchar(20) | Uniprot's protein identifier |
desc | blob | Uniprot's short description of the protein |
taxid | integer | NCBI taxonomy identifier |
gene | varchar(20) | Uniprot's gene symbol |
field | type | comment |
---|---|---|
species | blob | scientific name |
taxid | integer, primary key | NCBI taxonomy identifier |