Improved CDK Hashed Fingerprinter
Edited: 4th Nov, 10:20 AM
In my previous post, I discussed the impact of the hashcode and random number generators on a hashed fingerprint. They play a major role in the uniform distribution of the bits in a fixed length array and the occurrence of the bit clashes. In order to prove the concept, I have prepared a test case of 1200 molecules and preformed a substructure search using the default CDK Fingerprinter class and its improved Fingerprinter class version (with the Apache math librarys HashCodeBuilder() method and Mersenne Twister random number generator).
Each molecule was searched against other molecules in the dataset including itself. This was done at an interval of 200 data points. The gold standard was the substructure search results from the SMSD.
As expected the improved version of the Fingerprinter class outperformed the present CDK Fingerprinter class. The number of false positives (FP) were reduced by 35-40% (due to minimal bit clashes) thereby increasing the accuracy of the results, while the true positives remained unchanged. This also made an overall positive impact on the speed of the search results!
The raw results and the Fingerprinter code is available via my github account https://github.com/asad/CDKHashFingerPrint.
The present code can further be optimised for lowering the number of false positives.
Thus a better hashcode and random number generator leads to an improved hashed fingerprint.