Help for JST

Brief instructions

Load a structure (protein, nucleic acid or a complex of both) either from a local disk or online from the PDB database . You can also load one of the examples provided.
Once the model has finished loading into the Jmol pane, click on the 'prepare' button to have the sequence analysed.
Seq→3D: Hover the mouse pointer over the sequence to see the residues identified. Click on a residue to see it highlighted in the 3D model. Shift+click to focus on the residue in the 3D model.
3D→Seq: If enabled, click on an atom in the 3D model to have it located in the sequence.
Find: To search for a single residue or a sequence of residues if the sequence listing, type the sequence, choose the chain to search on (or none to search on all chains), and click on the button to run the search.

Sequence listings

Displayed sequence is the result of combining information from the ATOM/HETATM fields and the SEQRES fields in the pdb file.
The sequence of each chain is listed separately. Hetero groups are included in their chain, if this is assigned, or else go in a separate listing (labeled '()' since they have no chain id).

Letters

Residues are listed in uppercase one-letter code. Nucleotides are converted to single-letter even if deoxy (avoiding the DG etc. used by PDBv3).
Non-standard residues are listed as lowercase 'x'.
Inserted residues are included in the listing, as superscript one-letter codes.
When there is sequence microheterogeneity (residues in ATOM records that are absent in SEQRES), the alternate residues at the same sequence position will be enclosed in square brackets, e.g. [SL].
Water and solvent are omitted by default (but may optionally be included in the listing). When displayed, water groups will have a pale cyan background in the sequence listing.
Hetero groups are displayed as [x] (since they are both non-standard residues and present in ATOM/HETAM but not in SEQRES).
When there is a physical gap in the 3D model (that is, there are residues in SEQRES absent in ATOM, typically due to crystallographic disorder), the residues with no coordinates are listed in lowercase.
When there is a numbering gap (due to numbering according to a reference sequence) but no residues are missing in the 3D structure (SEQRES and ATOM records match), the position of the gap is indicated by two tildes surrounding a number indicating the size of the gap, e.g. ~1~, ~15~.

		data in SEQRES
		yes	no
data in ATOM or HETATM	yes, std	uppercase	[ ]
	yes, non-std	x	[x]
	yes, hetero		[x]
	no	lowercase

Numbering

Sequence numbers are taken from the ATOM records.
When there is a physical gap in the 3D model (that is, there are residues in SEQRES absent in ATOM, typically due to crystallographic disorder), the residues with no coordinates receive numbers increasing by 1 from the previous residue (this may lead both to several residues having the same number and to numbering gaps).

Interaction with the sequence

Mousing over the code for a residue in the sequence listing (be it standard or not) displays, in a slot, its chain name, residue number, insertion code if applicable (in lowercase) and the residue name. Amino acids use the standard Uppercase-lowercase-lowercase format; nucleotide names are displayed as they are in the file (e.g. G, DG, 5MC, PSU...). For nonstandard residues, a tooltip will be displayd with the full name of the residue or heterogen group (as defined in the HETNAM record).
If the residue has coordinates, clicking on its code in the sequence highlights all atoms of that residue in the 3D model and also displays its identification inside the Jmol panel.
Clicking on the residue code while holding any of the Shift, Ctrl, Alt keys focuses the 3D model on that residue, by zooming in on it and making it the center of rotation. (You can Shift+doubleClick on the background of the Jmol panel to reset zoom, orientation and center).
When there is a physical gap in the 3D model, mousing over the code of the residues involved produces a tooltip, and clicking produces an alert box, both explaining that the residue lacks coordinates.

Search for sequence patterns

Searches are invoked from a dedicated text slot, and can be applied to a single chosen chain or to all chains at once. Matches are highlighted in the sequence listing as bold, overlined and underlined, and also highlighted in the 3D model.
Entering a standard residue name in the search slot (as one-letter code) highlights all the locations of that residue in the sequence listing and also in the 3D model.
Entering a sequence fragment highlights any matches in the sequence listing and also in the 3D model. (It may fail if gaps or microheterogeneity are involved).

3D model

Clicking any atom in the 3D view highlights its residue in the sequence listing, applying a yellow background to its code, and also displays its identity in the information slot. (Clicking then on the sequence will highlight all the atoms of that residue in the 3D model).
Highlights in the 3D model, invoked from user interaction with the sequence listing or from pattern search, are done using cyan halos around the atoms, and also displaying the residue id at the top of the Jmol panel. (Other kinds of highlight could easily be programmed instead).

Other

Residues with alternate sidechain conformations (rotamers; multiple sets of coordinates for sidechain atoms) can be highlighted in the 3D model using a link.

Browser compatibility

JST makes heavy use of JavaScript, the DOM (document object model) and CSS styling. As a consequence, some old browsers do not work well with this tool.

Design details

Combination of sequences from SEQRES and ATOM

In pdb-formatted files, SEQRES records contain the sequence of the macromolecule chains, as reported by the authors. ATOM records, on the other hand, provide the coordinates of each atom together with the residue name and number, so effectively providing with a sequence of residues in the chains.

Although PDB specification says that both sets of records should match, it is common that they do not. This may be due to residues whose coordinates have not been resolved in the X-ray or NMR experiment (crystallographic disorder, physical gap) or to alternate residues found at the same sequence position (sequence microheterogeneity).

JST computes a combination sequence from both sources of information, and highlights the discrepancies using text formatting.

CAVEAT: the algorithm used by JST works reasonably well, but it is not completely trustable, particularly on gaps.
More research into it would be worthy. On the other hand, since we don't need to align two related proteins (as in a generic alignment task), but two sequences for the same one, failures are unlikely to occur.

A few refs. for future or prospective work:

“Note that the process of making an alignment between SEQRES and ATOM is actually quite complicated. You might want to look at the code in pdb2cif for quite robust code to do this.”
(pdb2cif is a Perl script to filter a PDB entry and produce a CIF file. From its source, it's not immediately obvious what the alignment algorithm is.)
Check out the JenaLib “alignment view” tool.

Alignment algorithm

JST is using a rather simplistic implementation of aligment based on Needleman/Wunsch techniques.

Created according to guidelines (as of May 2009) in:

“Dynamic Programming” - Eric C. Rouchka, BL5495 Course in Computational Molecular Biology, Washington University in St. Louis, USA.
http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html
“Sequence Alignment” - Serafim Batzoglou, CS262 Computational Genomics, Lecture 3, Win06, Stanford Artificial Intelligence Laboratory, USA.
http://ai.stanford.edu/~serafim/CS262_2006/Slides/CS262_2006_Lecture3.ppt
“Time speedup, General gap penalty function” - Saad Mneimneh, Computational Biology, Lecture 5, Hunter College, City University of New York, USA.
http://www.cs.hunter.cuny.edu/~saad/courses/compbio/lectures/lecture5.pdf

A simple scoring scheme is assumed using

S(i,j) (match score) if the residue at position i of sequence #1 is the same as the residue at position j of sequence #2;
otherwise: S(i,j) (mismatch score)
w (gap penalty)

First attempt: values were: 1, 0, 0; produces mismatches sometimes.
Second: 1,-1, 0 avoids mismatches; a bit slow for long sequences. (Later optimization of the code reduced this.)

Particular to our problem, in contrast to a generic alignment method, is that we prefer gaps over mismatches: it has little sense that SEQRES indicates Val and ATOM indicates Leu, for example, while in a generic alignment of two (related) proteins, that would be more expected than a gap.

Extraction of sequence from SEQRES

The SEQRES fields are read from the header section in the pdb file (obtained using Jmol built-in capabilities) and parsed with JavaScript to compile the sequence of each chain.

Extraction of sequence from ATOM

Rather than parsing the text content of the pdb file, the sequence is compiled from the same information used by Jmol in building the 3D model (that is, Jmol internal representation of data in the loaded file).