Shortest Path and Molecular Hashed Fingerprints
Shortest Path (SP) has been used in many aspects of graph traversing. The idea is to minimise the cost (number of edges to be traversed or the cost on the edge) of traveling between a source and destination. This is one of the most optimal ways of finding a path in the graph where you can generate a combination of paths using random walk.
Interestingly, generation of path based hashed fingerprint is very common in the area of chemo-informatics. The basic idea is to find all paths of a certain length from the source atom (fragments) in a molecule and convert it into a hashed fingerprint. This works very well with smaller or sparsed graphs, although in a few cases the run time may increase exponentially with the size of the graph (connectivity). The Chemistry Development Kit (CDK) has one such effective path based hashed fingerprint generator (Fingerprinter.java). This module in the CDK has generated a lot of interest from the user community. Recently, Nina Nikolova – Jeliazkova posted an interesting set of molecules were the path search in the fingerprint was hit by combinatorial explosion!
Note: The behaviour of the path finding algorithm is compromised once the depth of the path search is more than 6 (the recommended depth is 8). Hence for these set of molecules one may not be able to find the fingerprints.
Here are the exemplar molecules where the present CDK hashed fingerprint is subdued.
I have modified the existing CDK fingerprinter to report the shortest path rather than all paths. This overcomes the problem of combinatory explosion and runtime is no longer exponential as compared to previous case.
Here is the runtime and the density of the FP (number of bit occupied) as calculated by the SP based FP. One can deduce the runtime and density by half if the FP is only based either on weighted or unweighted bonds. In the modified SP based FP, I have used both weighted and unweighted bonds to give better consensus FP (more in my next blog!).
Here is the code.
a) Presently the fingerprint accounts for only one shortest path between a source and the sink atom (discriminates between aromatic, ring and aliphatic paths). Hence, I had to canonicalize the atoms in the graph container such that if two molecules are similar then the returned SP path is same. A natural extension would be to report k-shortest path but this maybe as good as CDK default fingerprinter (in terms of the runtime).
b) For spare graph and smaller graphs it might be as fast as the previous implementation, and it will perform better on complex graphs.
Edited: 6th August 2012
Here is a test case based on the ring systems (aromatic and non aromatic) and aliphatic molecules.
Updated: 29th Aug 2012
Thanks to Egon for his suggestions to use SP2 hybridization instead of aromaticity checker. In my case I have to use CDK aromaticity detection as SP2 concept may not work.
I have clustered 11 molecules based on their fingerprint similarity scores using the
a) CDK default finprinter,
b) SP based Fingerprinter and
c) CDK Hybridization Fingerprinter
The clustered results are as shown below.
The Hybridization based fingerprinter is the fastest one (in the non-complex cases), followed by the SP fingerprinter and improved CDK fingerprinter. In terms of the sensitivity and specificity, SP fingerprinter is the best and in complex cases its by far the fast one!
I will leave it to the readers to choose their favorite fingerprinter.
Kindly leave your comments and suggestions!