Tags

, , ,

The design principle of the SMSD code was always to find Maximum Common Subgraph (MCS) between two molecules. Of course, it can be optimised to find substructures between two molecules. I was lazy in doing so but a recent post on the CDK mailing list about the SMSD based substructure search performance was a wakeup call for me.

I have made some changes to the SMSD code and now the substructure search is not that bad. Nonetheless, I have to redesign the substructure search part of the SMSD as substructure search ideally shouldn’t perform an exhaustive comparison between all possible combinations of the graph. It’s usually a boolean question i.e. is A a subgraph/substructure of B, unlike in the case of MCS where it’s about reporting the commonality between A and B.

So now one can query substructure search in the SMSD using the following code.

private static int getSMSDSolutionCount(IMolecule query, IMolecule target) throws CDKException {
SubStructure substructure = new SubStructure();

substructure.set(query, target);
if (substructure.isSubgraph(true)) {

return 1;

} else {

return 0;

}

}

The underlying substructure code can be found here

https://github.com/asad/SMSD-GUI/blob/master/src/org/openscience/smsd/SubStructure.java

New link has optimised substructure search code

https://github.com/asad/SMSD/blob/master/src/org/openscience/smsd/Substructure.java

Results: Here is a small substructure search benchmark test between UIT and SMSD. I have taken 30 query molecules and fired each of them against a decoy of 15030 molecules. At least a hit is expected from each query run.

Substructure benchmark between SMSD vs UIT

Substructure benchmark between SMSD vs UIT

In this case UIT seems to have an edge over SMSD as, in a few cases it takes more time. I have to check if the real difference is between implementation of the code or the actual algorithm itself.

You can find the results and the benchmark code here https://github.com/asad/Benchmark

The change(s) which is required in the SMSD code will involve heavy refactoring of the VF lib code. The present code works fine for the MCS but it’s time to optimise it for the substructure search. Any suggestions in optimising the VF code would be more than welcome. Please refer to the following class if you have any suggestions 🙂

Links updated: 1st Oct 2011

https://github.com/asad/SMSD/blob/master/src/org/openscience/smsd/algorithm/vflib/map/VFMapper.java

https://github.com/asad/SMSD/blob/master/src/org/openscience/smsd/algorithm/vflib/map/VFState.java

I hope someone out there will have a cracking solution!