Description :

Domains play an essential role in our understanding of protein evolution and function either because they appear as substructures of a protein or correspond to individual 3D structures in their own right. They currently form the basis of the CATH and SCOP protein classifications. Using the VAST (download) algorithm (Gibrat et al. 1, Madej et al. 2) we observed that protein domains could be assigned from the recurrence of small common 3D substructures found in proteins of the PDB. (Tai et al. 3).

1 Determination of the domain boundaries


In the VAST algorithm, proteins are represented by their secondary structure elements (SSEs), more specifically by the endpoints of vectors going through these SSEs. The basic task of VAST is to find the best 3D substructure common to a query and a target. A 3D common substructure is formally defined as a one-to-one correspondence between a subset of SSE vectors in the query and a subset of the SSE vectors in the target. This correspondence respects the type and direction of SSE (i.e., helices are only paired with helices and strands with strands) and the topology (i.e. the order of SSEs within query and target). This ensemble is named a clique or a Locally Similar Structural Piece (LSSP) an example of which is given in fig 1. For further details about VAST see (hyperlink) Methods and Appendix in Sam et al. (4). The secondary structures of the query and the target are determined with the program KAKSI (download) (Martin et al. 5) from the atomic coordinates.

Fig. 1
Example of one of the 12,282 cliques for the query protein 1jjcB used to assign domains:

A and N matrices:

To obtain the domains boundaries, the server collects all the cliques having a Pcli > -10 (essentially disregarding the estimated likelihood that the clique arises from chance rather than evolutionary relatedness) and a rmsd < 5 å that are found by comparing a query protein with a representative set of protein chains of the PDB. These cliques are listed in the file *.mathlab (see HELP page). Then the cliques are extended by including the residues between two secondary structural elements of the clique if they are less than 40 (see fig.1).

The extended cliques are clustered as a binary matrix A along the query length. This matrix A is transformed into a co-occurrence N matrix presented as a heat-map/contour map (file *_Nmatrix.png, *_Nmatrix_contour.png) from which the domains are parsed. These maps give a global view of the recurrent structures (Fig. 2) and often a visual aspect of the domain structure of the query protein.

N matrix of the phenylalanyl-tRNA syhthetase (1jjc) chain B (contour map) where Nij is the number of cliques, gaps included, in which residues i and j of the query protein are found together. This chain consists of 6 domains. For instance, SMF algorithm finds the following domains: D1: [1-41, 151-196], D2: [42-150], D3: [197-398], D4: [399-484], D5: [485-676], D6: [677-785]. Note that D1 is a segmented domain as defined in CATH and SCOP. This matrix is made of several ten of thousands cliques (gaps included) ,and it gives a global view of the recurrence and of the domain organization.

Domain parsing

The A or N matrices are parsed into domains by three different methods: PCM, SMF and SVD (Tai et al., 3):

PCM for Pairwise Correlation Method (method) (download)

The rows of the clique-matrix A are treated as a multivariate statistical process with outcomes 0 and 1.The usual correlation matrix of this process is computed and hierarchical clustering is used to obtain clusters of residues corresponding to domains in the query sequence. The clustering process is terminated when the total correlation for residue pairs within a cluster, is maximized.

SMF for Symmetric Matrix Factorization (method) (download)

The rows of the N-matrix are hierarchically clustered into nd clusters for nd from 1 to 12. These clusters define domains. The optimal value of nd is chosen by dividing the N-matrix into nd-by-nd diagonal and off-diagonal blocks and then choosing the nd which minimizes the ratio of the maximum off-diagonal block density to the minimum diagonal block density.

SVD for Singular Value Decomposition (method) (download)

The N-matrix is binarized using multiple thresholds. Domain structure is obtained using the SVD procedure on each new matrix. This series of domain assignments are evaluated using a scoring scheme combining the degree of agreement between the original and calculated binarized matrices and the degree of physical separation between domains in the three-dimensional structure.

2 Structural neighbours

For a given query, the server also provides a list of targets selected from the file *.mathlab and matching two criteria: the length of the corresponding target clique, including gaps, amounts to >80% of the target length and >40% of the target length is aligned by VAST. These define what we call structural neighbours (file *SN.txt HELP page). In other words, the corresponding part of the query protein exists as a similar individual 3D structure in the PDB. Some of these are homologues of the query or its domains (to be published). Their numbers are much smaller than the cliques used for the domain assignments. Their alignments on the query are given as a graph (*SN.png, HELP page) by series of 20 with a link to PDBSUM (see Fig. 3). Each PDB code corresponds to a cluster of protein chains having 40% or more identical residues (PDB40) of the Protein Data Bank ( We consider that all the members of that cluster will similarly align with the query protein and are structural neighbours

Fig. 3
A snapshot of aligned structural neighbours for 1jjc chain B, sorted by decreasing numbers of per cent of aligned amino acid C Alphas, here a sampling of 20 out of a few hundreds and their corresponding PDB names with a link to PDBsum and their VAST structural alignments


1. Gibrat JF, Madej T, Bryant SH:
	Surprising similarities in structure comparison. 
	Curr Opin Struct Biol 1996, 6(3):377-385. 
2. Madej T, Gibrat JF, Bryant SH: 
	Threading a database of protein cores. 
	Proteins 1995, 23(3):356-369.
3. Tai CH, Sam V, Gibrat JF, Garnier J, Munson PJ and Lee BK. 
	Protein domain assignment from the recurrence of locally similar structures. 
	PROTEINS: Structure, Function, and  Bioinformatics 2011; 79;853-866
4. Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ: 
	ROC and confusion analysis of structure comparison methods identify the main causes of 
	divergence from manual protein classification. 
	BMC bioinformatics 2006, 7:206.
5. Martin, J., Letellier, G., Marin, A., Taly, JF., de Brevern, A. and Gibrat, JF. 
	Protein secondary structure assignment revisited: a detailed analysis of different assignment methods. 
	BMC Struct Biol.2005, 5:17.