Ultra Fast substructure search based on the VF2 outperforms CDK UIT

Updated (17 March 2010): Gilleain and myself were able to code an ultrafast VF2 algorithm for Isomorphism substructure search (permutation bug is resolved). The code is ported or re-factored from Chemkit and I have further optimised the code at our end. The previous VF2 code in the SMSD is more MCS oriented which was a big compromise in terms of speed while performing a substructure search. I have mentioned these issues in a previous blog. The whole idea of a substructure search is to answer a boolean question i.e. is A a subgraph/substructure of B ? If the answer is true then matching atoms are reported. On the other hand MCS where it’s about reporting the commonality between A and B.

Below is the benchmark graph which speaks about the performance (30 queries were subjected to 15030 targets, same as my previous benchmark test cases) of the VF2.

Optimised Chemkit VF2 vs CDK's UIT
Substructure benchmark results between VF and CDK’s UIT
X-axis: 30 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

As is apparent from the benchmark graph, VF2 takes less than 10-20% of the total time (in most cases) for performing substructure searches (results are sorted in ascending order of the query atom count). So as for now we have an implementation which at last outperforms UIT (CDK code) in terms of speed for automorphic graphs. The performance of the VF2 improves as the query atom count (graph size) increases. We might need a few changes in the CDK AtomContainer class itself to further improve the speed/performance but this will involve further discussions with CDK developers.

11 thoughts on “Ultra Fast substructure search based on the VF2 outperforms CDK UIT

  1. *very* impressive. Nice work! Is there anything weird about structure 12, that makes it an outlier?

    BTW, I think you swapped the description of the X & Y axes

  2. Actually, can you also create a plot showing how they scale with respect to substructure size, with the lines in a plot of absolute time versus substructure size? Or perhaps some complexity measure, if you think that is more important…

  3. The advantage of a blog is that you can ‘publish’ results immediately.

    The DISadvantage of a blog is that you can ‘publish’ results immediately.

    The VFLib code is not working, except for the trivial case of the identity isomorphism. In other words, it seems to be checking if two molecules have identical bond orderings. Naturally enough, this is extremely fast.

    I’m not saying that the code is not fast – it certainly should be – but getting the answer right is as important as getting it quickly :)

    • Yes, you are right.. test results with permutations are falling off. I guess we need to check the backtracking part. Shouldn’t be a rocket science as the VF2 lib version in the SMSD works perfectly :-)

    • Please test the latest code from the github (Isomorphism and permutations are working)… all fixed!
      AND the updated graph on the above blog is as good as the old!

      …its still cruising :-).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s