### Shortest Path and Molecular Hashed Fingerprints

#### by chembioinfo

Shortest Path (SP) has been used in many aspects of graph traversing. The idea is to minimise the cost (number of edges to be traversed or the cost on the edge) of traveling between a source and destination. This is one of the most optimal ways of finding a path in the graph where you can generate a combination of paths using random walk.

Interestingly, generation of path based hashed fingerprint is very common in the area of chemo-informatics. The basic idea is to find all paths of a certain length from the source atom (fragments) in a molecule and convert it into a hashed fingerprint. This works very well with smaller or sparsed graphs, although in a few cases the run time may increase exponentially with the size of the graph (connectivity). The Chemistry Development Kit (CDK) has one such effective path based hashed fingerprint generator (Fingerprinter.java). This module in the CDK has generated a lot of interest from the user community. Recently, Nina Nikolova – Jeliazkova posted an interesting set of molecules were the path search in the fingerprint was hit by combinatorial explosion!

Note: The behaviour of the path finding algorithm is compromised once the depth of the path search is more than 6 (the recommended depth is 8). Hence for these set of molecules one may not be able to find the fingerprints.

Here are the exemplar molecules where the present CDK hashed fingerprint is subdued.

git://gist.github.com/3163031.git

I have modified the existing CDK fingerprinter to report the **shortest path rather than all paths**. This overcomes the problem of combinatory explosion and runtime is no longer exponential as compared to previous case.

Here is the runtime and the density of the FP (number of bit occupied) as calculated by the SP based FP. One can deduce the runtime and density by half if the FP is only based either on weighted or unweighted bonds. In the modified SP based FP, I have used both weighted and unweighted bonds to give better consensus FP (more in my next blog!).

Here is the code.

https://github.com/asad/ShortestPathMoleculeFingerprinter

Challenges:

a) Presently the fingerprint accounts for only one shortest path between a source and the sink atom (discriminates between aromatic, ring and aliphatic paths). Hence, I had to canonicalize the atoms in the graph container such that if two molecules are similar then the returned SP path is same. A natural extension would be to report k-shortest path but this maybe as good as CDK default fingerprinter (in terms of the runtime).

b) For spare graph and smaller graphs it might be as fast as the previous implementation, and it will perform better on complex graphs.

c) My present implementation is an extension of my previous work on the CDK fingerprinter where rings search and other optimizations has been done.

**Edited: 6th August 2012**

Here is a test case based on the ring systems (aromatic and non aromatic) and aliphatic molecules.

**Updated: 29th Aug 2012**

Thanks to Egon for his suggestions to use SP2 hybridization instead of aromaticity checker. In my case I have to use CDK aromaticity detection as SP2 concept may not work.

I have clustered 11 molecules based on their fingerprint similarity scores using the

a) CDK default finprinter,

b) SP based Fingerprinter and

c) CDK Hybridization Fingerprinter

The clustered results are as shown below.

The Hybridization based fingerprinter is the fastest one (in the non-complex cases), followed by the SP fingerprinter and improved CDK fingerprinter. In terms of the sensitivity and specificity, SP fingerprinter is the best and in complex cases its by far the fast one!

I will leave it to the readers to choose their favorite fingerprinter.

Kindly leave your comments and suggestions!

Thanks a lot, Asad. Very interesting indeed.

A few comments:

1. You might want to mention in your post which of the 14 Fingerprinter classes in the CDk might be affected.

2. The post would be truly helpful if you generated pictures. There is no clear indication what, if any, the bug in the FingerPrinter is and looking at the implementation, I cannot see why there should be a combinatorial explosion. I have not been able so far to depict the molecules that Nina gave. As John May points out in the reply to the bugs, they are “special cases”, like carbons with 14 bonds, and I would love to see a table with those molecules before making any judgement.

3. The CDK fingerprinter, in my opinion, does a regular breadth-first traversal, starting from every atom, to walk the graph up to depth 8 (or a given depth). After the given depth, it stops. So it does not compute “all pathes”, but only the given subset. When I studied chemistry, the highest coordination number known was 9 in rhenium hydrides :-) Now we seem to be up to 12 (http://en.wikipedia.org/wiki/Coordination_number). That gives us an upper bound for the worst case scenario. In any case, 99% of the relevant molecules are below maximum node degree 7, with 99.99% at node degree 4 (just my estimate – would love to be corrected :-)). Assuming a graph with n nodes where all nodes are of maximum degree X < 12, the worst case cost for computing all graphs a traversal until depth Y (<=8) is O((12^8)n), so we have a linear time algorithm with regard to the number of atoms. In most cases, this should be O((4^8)n). Hope that is not complete rubbish :-)

4. I guess you've already realised that the canonicalization needs to go since it adds a step of higher complexity than the actual graph traversal up to depth X.

5. The value of path-based fingerprints as given by the daylight algorithm is that they implicitely encode the substructures in a molecules through exhaustive generation of pathes up lenght X. Do we loose anything if we only use the shortest pathes? Maybe not. To lazy to think this through now :-)

Hi Chris,

Thanks for your feedback!

1. Yes I have updated my blog to show that its the default CDK hashed fingerprinter which is in question :-)

2. I have generated a cluster based on the molecule similarity of the various mentioned fingerprints, which will make my post much clearer in terms of chemical relevance.

3. Indeed, I too agree that BFS is the right way to go ahead but I would not traverse all the paths, rather only the shortest paths for each destination node else I would end up revisiting the same node at different depths…closer to DFS! This, I think is where most of the time is spend.

4. This canonicalization is very simple sorting based on atom symbols and their vertex degree (hybridization), so clearly its not a bottleneck! :-)

5. Well, I can’t say for sure that it might loose a path (without an exhaustive test case) but mathematically k-shortest path should be able to capture the topology of the graph.

Hope I am on track with my answers!

:-)

Asad, why did you use aromaticity in the HashedSPFingerprinter? Did you find that important, or is it just because the default Fingerprinter did that too? That is, have you tried basing it on the HybridizationFingerprinter instead?

Do you plan to convert this work into a patch? It is quite interesting…

a) While finding a path, I do discriminate between aromatic path and non-aromatic path (in ring systems). In the default fingerprinter its not used while finding paths whereas I have used it.

b) I did a quick test run on the HybridizationFingerprinter and I think it performs very well except that it doesn’t discriminate between open and closed ring system (wondering how it will perform if it has elements which doesn’t match with the CDK Atomtype module).

Having said that I think it’s a very one of the best bet amongst the present CDK fingerprinters.

Here is the clustered image:

c) Thanks, good point…hmmm.. well ..if you think it will help the tool, I would be glad to do so in the coming weeks :-)

Ad b): the HybridizationFingerprinter is in fact the same code as the Fingerprinter, exact that is replaces ‘aromaticity’ with ‘sp2 hybridized’.

thanks for the tip..I have replaced aromaticity by SP2.

Thats an interesting point Egon.

So logically that there is hardly any difference between the default fingerprinter and Hybridization fingerprinter except the aromaticity is perceived as “sp2″ (which is chemically valid too!).

Although the similarity clusters from Hybridization fingerprinted has an edge over the native default fingerprinter!

Also this would mean that SP based fingerprinter has a room in the present CDK…