Help for JST
Sequence listings
- Displayed sequence is the result of combining information from the ATOM/HETATM fields and the SEQRES fields in the pdb file.
- The sequence of each chain is listed separately. Hetero groups are included in their chain, if this is assigned, or else go in a separate listing (labeled '()' since they have no chain id).
Letters
- Residues are listed in uppercase one-letter code. Nucleotides are converted to single-letter even if deoxy (avoiding the DG etc. used by PDBv3).
- Non-standard residues are listed as lowercase 'x'.
- Inserted residues are included in the listing, as superscript one-letter codes.
- When there is sequence microheterogeneity (residues in ATOM records that are absent in SEQRES), the alternate residues at the same sequence position will be enclosed in square brackets, e.g. [SL].
- Water and solvent are omitted by default (but may optionally be included in the listing). When displayed, water groups will have a pale cyan background in the sequence listing.
- Hetero groups are displayed as [x] (since they are both non-standard residues and present in ATOM/HETAM but not in SEQRES).
- When there is a physical gap in the 3D model (that is, there are residues in SEQRES absent in ATOM, typically due to crystallographic disorder), the residues with no coordinates are listed in lowercase.
- When there is a numbering gap (due to numbering according to a reference sequence) but no residues are missing in the 3D structure (SEQRES and ATOM records match), the position of the gap is indicated by two tildes surrounding a number indicating the size of the gap, e.g. ~1~, ~15~.
Numbering
- Sequence numbers are taken from the ATOM records.
- When there is a physical gap in the 3D model (that is, there are residues in SEQRES absent in ATOM, typically due to crystallographic disorder), the residues with no coordinates receive numbers increasing by 1 from the previous residue (this may lead both to several residues having the same number and to numbering gaps).
Interaction with the sequence
- Mousing over the code for a residue in the sequence listing (be it standard or not) displays, in a slot, its chain name, residue number, insertion code if applicable (in lowercase) and the residue name. Amino acids use the standard Uppercase-lowercase-lowercase format; nucleotide names are displayed as they are in the file (e.g. G, DG, 5MC, PSU...). For nonstandard residues, a tooltip will be displayd with the full name of the residue or heterogen group (as defined in the HETNAM record).
- If the residue has coordinates, clicking on its code in the sequence highlights all atoms of that residue in the 3D model and also displays its identification inside the Jmol panel.
- Clicking on the residue code while holding any of the Shift, Ctrl, Alt keys focuses the 3D model on that residue, by zooming in on it and making it the center of rotation. (You can Shift+doubleClick on the background of the Jmol panel to reset zoom, orientation and center).
- When there is a physical gap in the 3D model, mousing over the code of the residues involved produces a tooltip, and clicking produces an alert box, both explaining that the residue lacks coordinates.
Search for sequence patterns
- Searches are invoked from a dedicated text slot, and can be applied to a single chosen chain or to all chains at once. Matches are highlighted in the sequence listing as bold, overlined and underlined, and also highlighted in the 3D model.
- Entering a standard residue name in the search slot (as one-letter code) highlights all the locations of that residue in the sequence listing
and also in the 3D model.
- Entering a sequence fragment highlights any matches in the sequence listing
and also in the 3D model. (It may fail if gaps or microheterogeneity are involved).
3D model
- Clicking any atom in the 3D view highlights its residue in the sequence listing, applying a yellow background to its code, and also displays its identity in the information slot. (Clicking then of the sequence will highlight all the atoms of that residue in the 3D model).
- Highlights in the 3D model, invoked from user interaction with the sequence listing or from pattern search, are done using cyan halos around the atoms, and also displaying the residue id at the top of the Jmol panel. (Other kinds of highlight could easily be programmed instead).
Other
- Residues with alternate sidechain conformations (rotamers; multiple sets of coordinates for sidechain atoms) can be highlighted in the 3D model using a link.
Design details
Combination of sequences from SEQRES and ATOM
In pdb-formatted files, SEQRES records contain the sequence of the macromolecule chains, as reported by the authors. ATOM records, on the other hand, provide the coordinates of each atom together with the residue name and number, so effectvely providing with a sequence of residues in the chains.
Although PDB specification says that both sets of records should match, it is common that they do not. This may be due to residues whose coordinates have not been resolved in the X-ray of NMR experiment (crystallographic disorder, physical gap) or to alternate residues found at the same sequence position (sequence microheterogeneity).
JST computes a combination sequence from both sources of information, and highlights the discrepancies using text formatting.
CAVEAT: the algorithm used by JST works reasonably well, but it is not completely trustable, particularly on gaps.
More research into it would be worthy. On the other hand, since we don't need to align two related proteins (as in a generic alignment task), but two sequences for the same one, failures are unlikely to occur.
A few refs. for future or prospective work:
- “Note that the process of making an alignment between SEQRES and ATOM is actually quite complicated.
You might want to look at the code in pdb2cif for quite robust code to do this.”
(pdb2cif is
a Perl script to filter a PDB entry and produce a CIF file. From its source, it's not immediately obvious what the alignment algorithm is.)
- Check out the JenaLib “alignment view” tool.
Alignment algorithm
A rather simplistic implementation of aligment based on Needleman/Wunsch techniques.
Created according to guidelines in (May 2009):
- “Dynamic Programming” -
Eric C. Rouchka, BL5495 Course in Computational Molecular Biology, Washington University in St. Louis, USA.
http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html
- “Sequence Alignment” - Serafim Batzoglou, CS262 Computational Genomics, Lecture 3, Win06, Stanford Artificial Intelligence Laboratory, USA.
http://ai.stanford.edu/~serafim/CS262_2006/Slides/CS262_2006_Lecture3.ppt
- “Time speedup, General gap penalty function” - Saad Mneimneh, Computational Biology, Lecture 5, Hunter College, City University of New York, USA.
http://www.cs.hunter.cuny.edu/~saad/courses/compbio/lectures/lecture5.pdf
A simple scoring scheme is assumed using
- S(i,j) (match score) if the residue at position i of sequence #1 is the same as the residue at position j of sequence #2;
- otherwise: S(i,j) (mismatch score)
- w (gap penalty)
First attempt: values were: 1, 0, 0 - produces mismatches sometimes.
Second: 1,-1, 0 avoids mismatches; a bit slow for long sequences. (Later optimization of the code reduced this.)
Particular to our problem, against a generic alignment method, is that we prefer gaps over mismatches: it has little sense that SEQRES indicates Val and ATOM indicates Leu, for example, while in a generic alignment of two (related) proteins, that would be more expected than a gap.
Extraction of sequence from SEQRES
The SEQRES fields are read from the header section in the pdb file (obtained using Jmol built-in capabilities) and parsed with JavaScript to compile the sequence of each chain.
Extraction of sequence from ATOM
Rather than parsing the text content of the pdb file, the sequence is compiled from the same information used by Jmol in building the 3D model (that is, Jmol internal representation of data in the loaded file).