Superimposing proteins

Imagine we have executed extensive molecular dynamics or Monte Carlo simulations and have generated a trajectory of protein structures. This trajectory shows the evolution of a protein as its structure fluctuates with time. Or imagine a slightly different scenario, where we have two trajectories and must compare them. Ultimately, we want to compute the root-mean-square deviation (RMSD) between structures to understand their similarities but are unable to do this directly. Why not? Well, we can't do this because, with time, proteins rotate and translate, and these different orientations prohibit easy RMSD analysis.

This problem is similar to the one that pops up on all sorts of cognitive and spatio-awareness tests. Given the figure below, are structures A and B the same? What about C and D?


Hopefully the answer we have all arrived at is that, yes, A and B are the same (after a rotation), but C and D are not (they're mirror images of one another, so, no matter what rotation we do, they won't match up). The same types of observations can be conceptually made for protein or other biomolecular structures. (Well, maybe not the bit about mirror images; proteins prefer a particular side of the looking glass). Before we can analyze the similarity between biomolecules, we must ensure that they are superimposed and oriented in such a way that they overlap to the best of their ability.

The algorithm to perform the appropriate rotation (and translation) to which we are specifically referring is the one by Kabsch. Strictly speaking, this algorithm works for any two collections of x, y, and z coordinates as long as the number of coordinates between those two collections is the same. This means that, if we want, we could apply superimposition to only parts of a protein or between two different proteins entirely. In the text below, we are going to provide a glimpse into this algorithm ignoring many of the leviathon mathematical principles, particularly related to matrices, lurking underneath. By the end, we will have a practical understand of superimposition and understand the steps needed to apply our algorithm to simulation structures.

The first step of the algorithm, which is the easy part, is to translate our two atom collections so they're at the origin. This means that we subtract the geometrical centers of mass from all x, y, and z coordinates for both collections. In mathematical notation, this means, for all i of N coordinates, we compute,


The second step of rotating our atom collections is a somewhat more involved task. The basic idea is that we will compute a matrix that tells us how we must rotate our x, y, and z coordinates to get alignment. This matrix is effectively just a measure of the correlation of all coordinates in our two collections. Let's break this down into a series of smaller tasks. First, we have to determine this correlation of all coordinates between our two collections. This means computing the cross-covariance matrix A. If we call our two sets of coordinates P and Q, we can compute,


Here, PT is the transpose of P. Both P and Q must contain the same dimensions. In three-dimensional space, this means that there will be N-by-3 values. If the P (or Q) typically takes on the shape,


then the transpose PT will be in the form,


It suffices to say that the matrix product of PT and Q where therefore be,


In this equation, the subscripts P and Q are used to indicate to which coordinate sets the values of x, y, and z belong. This matrix A is ultimately very close to what we want. We want a matrix that we can apply to one set of our coordinates to transform them so they're overlayed on top of the other coordinate set. A is close but not exact because it doesn't necessarily apply the handedness of the coordinate system we want. We need to slightly modify A to behave like we want.

This moves us to the second task, which is to take this new matrix A and run singular value decomposition. The technical details of such a calculation are beyond this post. However, the end result of this process is a decomposition of our matrix A into the product of three matrices: U, S, and VT. Related to Eigenvectors, these matrix decompositions are a property of our matrix A. We can take the matrices U and VT directly but need to toy with S so it agrees with a right-handed coordinate system. To do this, we turn S into a 3-by-3 identity matrix with the final element on the bottom right set to the determinant of the matrix product of U and VT. Taken together, this means that we have a new matrix called A' that, when applied to our x, y, and z coordinates, will optimally rotate them. We define A' as,


And this is mostly it. With A', we now have a matrix that tells us how we can rotate one coordinate set on top of another. For instance, if we wanted to rotate P (importantly, the matrix we transposed) on top of Q, we simply perform the matrix multiplication of P by A'.

These steps have been implemented within LLTK distributed by Lockhart Lab. The full source code is visible on Github. There are actually a couple of ways to apply this code. Directly, by using the function, we can perform a fit on two sets of N-by-3 coordinates. We can also apply a fit to Structure objects within LLTK using a wrapper function in the Structure class. We offer several parallel methods of fitting because of the permanence of such a change. The function is applied to NumPy arrays and is more malleable. The Structure class function permanently transforms Structure coordinates. Examples of both methods of application are below.

# Import LLTK
import LLTK.transform

# Load in structures"my_protein1.pdb")"my_protein2.pdb")

# Obtain structure coordinates

# Option 1: Fit coord2 to coord1
# This only applies to coord2 and does not permanently affect structure2,coord2)

# Option 2: Fit structure2 to structure1
# This permanently changes structure2 coordinates

Now that we have superimposed our two sets of coordinates, we can effectively compute RMSD and measure the similarity of our structures. This is important to understand the evolution of our trajectory and also the relatedness between independent trajectories. Looking down the road, we can also use RMSD calculations as a driving force for the clustering of structures. But we'll discuss this another day.