Max-Planck-Institut für Informatik
max planck institut
informatik
mpii logo Minerva of the Max Planck Society
 

FAQ


Below are answers to the typical questions related to using StructMAn. If this page could not provide an answer to your question(s), do not hesitate to contact us.

What is the VCF file?


The variant call format (VCF) is the standard output format of the most polymorphism calling algorithms. The exact specifications can be found here. A VCF file is always refering to a reference genome. We provide compatibility with the standard reference genomes of all species contained in the UCSC and NCBI-refseq databases. It is not possible to use VCF files, which do not mapped to one of the standard reference genomes. You have to perform first a mapping of the polymorphism to their genes in order to produce a SMLF file.

What is the SMLF file?


The idea of the simple mutation list format (SMLF) is to have a file format, which provides the information of mutations and their respective proteins while being as simple as possible. A SMLF file is a tab separated value file with only two columns. The first column contains the Uniprot entry name of the protein. (warning: do not confuse them with the Uniprot accesion number). The second column contains the amino acid variance in the following format: [one-letter code of the wildtype amino acid][position of the mutation in the amino acid sequence of the protein][one-letter code of the new amino acid]. For example: V302D

Example SMLF file:

SMLF file example

What do the different output options?


When one uploads a job all given mutation datasets will be analysed once by the pipeline and the result will be stored in the MySQL database that connects all informations produced by the pipeline for at least one week. The initial analysis may take some time (typically, 40 mins for 1000 nsSNPs), so it is recommended to provide an email address to access the results later. The different outputs will be generated from these data and so choosing additional output options will not increase the running time significantly. By that we recommend always choosing at least the Annotation Output.

Annotation output

The annotation output lists all produced structural annotations and contacts, and allows to visually inspect position of each nsSNP in the corresponding 3D structures in a separate window. For each position of a nsSNP, the following information is displayed:
  • Protein
    displays the Uniprot-ID of the protein harbouring the nsSNP.
  • Structure
    displays the PDB-ID of the 3D structure used for the structural annotation of the nsSNP.
  • Mutations
    displays all amino acid variants for the position provided in the input dataset.
  • Score
    The interaction score is the product of the structure quality score and the annotation candidate score. The interaction score displays the potential impact of the substitution corresponding to the nsSNP on the protein interactions. By default, the output list is sorted by this score.
  • 3D-Viewer
    This button opens a new tab in your browser and loads the 3D structure of the protein or the protein homolog. In the structure, the residue corresponding to the nsSNP is shown in a balls-and-sticks style, while the surrounding protein chains are displayed as cartoons. The distances to all interaction partners are displayed.

Proteinsort output

This option allows the sort the output by the cumulative interaction score for all nsSNPs in a specific protein. That allows to select proteins with many high impact mutations.
  • Protein
    displays the Uniprot-ID of the protein harbouring the nsSNP.
  • Score
    This protein specific score is computed by finding the homolog structure with the highest cumulative sum of interaction scores, which are assigned to the nsSNPs of the protein.
    Protein Score Formulars

GO term analysis

In the GO term analysis, all proteins from the input are grouped according to their GO terms. The GO term specific groups are then scored by the sum of their protein scores (see Proteinsort output), normalized by total number of proteins of the input set. This analysis reflects the overrepresentation of critical mutations in proteins with a certain biological function, process or localization. Given that there might be a natural bias in the input dataset, for a more clear picture one might prefer to perform a differential GO term analysis of a given input set versus a reference data set.

Differential GO term analysis

For this analysis, the user has to upload exactly two input data sets. The server performs the simple GO term analysis on both sets and then compares the results to each other. The output is sorted by the difference of the GO term scores of the GO terms that appear in both sets. The absolute value of the GO term score is not displayed, thus allowing to study the relative overrepresetation of certain GO terms corresponding to the mutations. If the difference is positive, the corresponding GO term is overrepresented in the first dataset, and vice versa.

Pathway analysis & Differential pathway analysis

The (differential) pathway analysis is similar performed as the (differential) GO-Term analysis, but the the proteins are grouped according to the pathways of the Reactome Database.


How are the scores computed?


The scores are linear combinations of a weight vector and a value vector and are normalized by the sum of all weight factors. The weight factors are an opportunity for the user to balance the scoring after his own intentions.

Example 1

For the structure quality scoring you want just consider the sequence identity and the coverage and decide, that the sequence identity is three times more important than the coverage. In this case you would set:
  • Sequence Identity Weight Factor = 3
  • Coverage Weight Factor = 1
  • Resolution Weight Factor = 0
  • R-Value Weight Factor = 0

Example 2

For the annotation candidate score you only want to consider contacts between the substituted residue and low molecular weight molecules. In this case you would set:
  • Ligand Distance Weight Factor > 0
  • Chain Distance Weight Factor = 0

Structure quality score


Structure Quality Score Formulars

Annotation candidate score


Annotation Candidate Score Formulars

Interaction Score

The interaction score is the product of the structure quality score and the annotation candidate score. It combines the structural informations of the residue position with reliability given by the structure quality.


What is a specified ligand analysis?


The specified ligand analysis enables the user to identify mutations, which are in contact to certain ligand molecules or to a certain class of ligand molecules. For the specified ligand analysis, the user has to provide a file with a list of low molecular weight ligands. The file can be in any format readable by the OpenBabel toolkit. The extension of the uploaded ligand file must correspond to the file format, as required by OpenBabel. The popular formats are: *.smi (SMILES), *.sdf (Structure Data File), *.mol2 (MOL2) or *.inchi (IUPAC INCHI Format). Additionally, the user has to provide a Tanimoto distance threshold. The pipeline finds all ligand molecules, which are similar enough according to the chosen threshold to the given molecules in the HETATM entries in all PDB files, and checks their distances to the mutations given in the mutation input dataset. The output will be sorted according to the similarity and the distance.

Options:


The options can be set after pressing the Show Options button on the upload page.
The options of StructMAn can either affect the selection of the homologous proteins with resolved 3D structures by the pipeline or the scoring function (and by that the sorting order of the produced outputs). One can generally say, that the more one restricts the template structure selection, the fewer nsSNPs will be structurally annotated, and fewer structures will be considered for annotation of a single nsSNP. The runtime also gets shorter and the output potentially more reliable. We do not recommend to restrict the template structure selection more than the default parameters, but relaxing these parameters may allow to identify more distant homologs with experimentally resolved 3D structures.

Input size dependent template filtering

This option reduces the number of templates used for the analyses. The amount of reduction is based on the number of given mutations. The reduction goes never below ten template structures per target protein. The reduction starts at input sizes of one hundred mutations and doubles with every order of magnitude. An input of 1.000 mutations is reduced by 50%, an input 10.000 is reduced by 75%, and so on.

Sequence identity threshold:

The sequence identity threshold restricts the template structure selection of the pipeline to pick structures with a sequence identity of the local pairwise amino acid sequence alignment of the template structure with the given protein.

Coverage threshold:

The coverage threshold is another filter based on the alignment between the homologous protein and the given protein. The coverage is the ratio between the length of the local pairwise sequence alignment and the length of the amino acid sequence of the protein.

Resolution threshold:

The resolution of the 3D structure of the homolog is a good measure of the overall quality of the experimental 3D structure. The value is given in angstrom.

Weight factors:

The weight factors balance the different sub-scores described in the scoring section.