# Clustering with Daura et al.

In data science and machine learning, one of the ways of performing unsupervised learning is through clustering. This means that, given a set of data, we are devising a way to separate its elements into distinct pools, where stuff within each pool is typically more similar than with stuff outside the pool. There are many, many different ways of clustering, and not all clustering algorithms are created equal.

Let's focus our problem here on a specific set of data: proteins. If we have a trajectory of protein structures, or an ensemble of protein structures from, say, replica-exchange simulations, can we pick out which structures are similar? Ultimately, we want an algorithm that is able to meaningfully separate protein structures, so each pool of structures exhibits unique properties from the rest. We also want an algorithm that gives us an idea of the center of our pool. The center is close to the average, meaning the protein structure is most descriptive of that pool. This last point can be taken two ways. One is that the center is a real structure that exists in the trajectory. Another is that center is not guaranteed to be real but rather some fictitious average. It depends on the algorithm for what we'll get.

The algorithm we propose to look at here is the one from Daura et al. This is a relatively straightforward algorithm that allows us to establish distinct clusters (our pools of structures) with a center that is an extant structure. Perhaps in argument for this algorithm over others, it is implemented almost ubiquitously over molecular dynamics analysis packages. For instance, VMD and Gromacs have versions of this algorithm. As a boon, this algorithm is also deterministic, so every time you run it you should get the same results.

The piece of the puzzle that has gone unmentioned is that we need a reliable way to tell how far apart two protein structures are; we need some measure of distance. For proteins, the simplest measure of distance to imagine is the fitted root-mean-square deviation (RMSD). We take every pair of protein snapshots from our trajectory, fit them atop each other, and then compute the RMSD of Cartesian coordinates. (Typically, this type of analysis is restricted to only Cα atoms.) Once we have the distance computed between all pairs of proteins, we're ready to go!

But how does the algorithm work? How do we determine which structures are in one pool or another? The Daura et al. algorithm requires us to specify a distance cutoff. This is our measure for which structures might (and only might) be connected. Here are our steps:

1. We start off by presuming that each structure in set of N structures might be a cluster center.
2. For each center, we count the number of protein structures within the cutoff to that center.
3. We select the center that has the largest number of connections and designate this as a cluster. If there are M structures in this cluster, we now remove them from our set of structures available to form new clusters. In other words, N=N-M.
4. We go back and repeat steps 1, 2, and 3 until all structures have been assigned a cluster (or we start generating clusters of size 1, meaning there are no connections).

Let's think about some implications from this algorithm. We had said that structures in a cluster only might be connected. This is because the algorithm only really specifies that each structure in the cluster is connected to the cluster center. Structures in a cluster that are not the center are not guaranteed to be within the RMSD cutoff from one another. The effect can be quite dramatic; structures in a cluster's periphery might be connected to structures in another cluster.

Generally, this is where things start to become murky with the algorithm from Daura et al. It's simple to understand and easy to implement, but the selection of the RMSD cutoff is arbitrary. We need to play around with several values to understand how changing the cutoff affects the distribution of structures in clusters.

This clustering algorithm has been implemented in LLTK provided by Lockhart Lab. Like most of our code, we try to make it as easy to use as possible. The implementation in VMD, for instance, works by computing the RMSD between all structures and then immediately performing clustering. We must provide the RMSD cutoff upfront. This is computationally expensive if we want to change the cutoff value because we must therefore compute the RMSD between all structures with every cutoff update. For comparison, our algorithm allows us to supply either a group of structures (or coordinates) for clustering as VMD does or a matrix of distance values between all N structures. Separating clustering into these two distinct tasks saves time, especially when we're not sure what the correct cutoff value should be.

The LLKT.analysis.cluster function exists within the LTLK.analysis subpackage. This function takes three arguments: the first is a distance matrix describing the connectivity between N structures, the second is the RMSD cutoff, and the third is an optional argument min_cluster_size that allows us to limit the output. This third argument is useful because we might have a circumstance where we have very sparse connectivity and all structures effectively are their own clusters. Rather than suffer through too much output, we can restrict it so we have a minimum permissible cluster size. By default this value is 1. Let's take a look at how to use this code.

# Import LLTK
import LLTK.analysis
import LLTK.io
import LLTK.structure

# Load in trajectory
if not isinstance(LLTK.structure.StructureSet):
raise TypeError("expecting StructureSet object")

# Cluster Method 1
distance_matrix=trajectory.rmsd(fit=True)
cluster1=LLTK.analysis.cluster(distance_matrix,cutoff=4.,min_cluster_size=1)

# Cluster Method 2
cluster2=trajectory.cluster(cutoff=4.,fit=True,min_cluster_size=1)

As demonstrated above, there's actually two ways to use the LLTK.analysis.cluster(...) method. We can call this method directly and supply a distance matrix, which allows us to perform clustering in two parts. As wrote above, this can be beneficial when you are unsure of the right cutoff and want to play around with the parameters. If you don't want to bother with computing the distance matrix yourself (for instance, by calling the RMSD function), you can also use the cluster function that exists within the StructureSet class. This function is just a wrapper; it still eventually calls the LLTK.analysis.cluster(...) function. However, this second approach is easier to use because it handles some of the calculations for you. Finally, if we use the class function, we can also supply an argument called index, that lists the indices of structures we want to cluster. This would allow us to skip every other structure, or every third structure, whatever.

The clustering code presented above also performs clustering on all atoms within a trajectory PDB file. What if we wanted to use only a subset of those atoms in a trajectory? We need to select those atoms out of a trajectory. The most straightforward way, but perhaps the way that requires the most code, is to handle this manually. We can go through every Structure in our StructureSet, get the coordinates of those Cα atoms, and then use that information to compute the distance matrix. See below.

# Figure out the indices of our Cα atoms
index=[i for i in trajectory.structure.atom if i.atomname == "CA"]

# Get the coordinates for all atoms
coord=[i.get_coord(index) for i in trajectory.structure]

# Finally, cluster according to method 1
distance_matrix=LLTK.analysis.rmsd(coord,fit=True)
cluster=LLTK.analysis.cluster(distance_matrix,cutoff=4.,min_cluster_size=1)

This is the first cluster algorithm implemented in LLTK, but hopefully not the last. There are several other functions I plan on implementing, so stay tuned!

Christopher LockhartComment