Revisiting Molecular Hashed Fingerprints
Fingerprints have been widely used in various fields to find similar features. Now for those of you who are using their detective instincts and aiming for DNA fingerprint or biological fingerprints, I might disappoint you in the later half of my post. Fingerprints are typically used to avoid cumbersome data comparison by using shorter “bit” string. My focus will be on the molecular fingerprints which have been used by chemo/bio informatician for finding similar molecular structures i.e. finding a needle in a hay stack! Theoretically, if you know the prerequisite features of “should have and not have” in the target molecules, then you can use a set of predefined keys to generate fingerprints. For examples PubChem fingerprint, MACCS keys etc. are based on certain substructure/SMARTS keys which are expected to be found or skipped in your target. On the other hand when we play with unknowns both at the level of query and target then one of the fastest ways to go for the kill is hashed fingerprints. Typically, in a hashed fingerprint a set of patterns are generated by gathering atom environment information or subgraph information or both. The generated patterns are then transformed into hash codes (a fixed size message digest) using hashing algorithm in computer science. These hash codes can then transformed into bit strings using random number generation of a defined length (size of the fingerprint). The presence and the absence of a pattern is marked as “1” and “0” respectively.
- Hashed fingerprints are like a black box with an assurance that similar patterns will have similar bits set to “1”. In the language of information science you are allowing clashes of the similar bits with certain probability.
- The size of the generated fingerprints can be controlled by the user as predefined knowledge of the fingerprint patterns are not required.
- The resolution of the fingerprints depends on algorithms used for generating the hash code and random numbers.
- It’s challenging to find a perfectly sized fingerprints which can strike a balance between minimising the clashes of bitsets and wastage of the bit space.
Let’s play with some real-time examples to understand the depth of the above mentioned statements. Now we need to generate some patterns from molecules and store them as fingerprints. In order to analyse the quality of the fingerprints we will open the black box by keeping track of the generated pattern types. This will help us to quantify the patterns involved in the bitset clashes. The circular fingerprint or molecular signatures can be used to generate patterns of various diameter/height for a molecule. By increasing the diameter/height, we can enrich the patterns/information about the molecules. However, this will also increase the overhead of balancing the fingerprint size and reducing the bit clashes.
Stage 1: Generate patterns using molecular signatures of heights 0 to 3 for every atom in the molecule. An example is illustrated in the figure below.
Stage 2: Transform these patterns as SMARTS/SMILES/Signatures and generate hash code for each pattern using your favourite algorithm.
Stage 3: Once we have the hash codes for these patterns then using random number generator, convert these hash codes into bit set bucket with a fixed range (eg. 1024).
I have used the CDK to generate molecular signatures (σ) of various heights (0 to 3) for 5000 mols. These signatures were transformed into canonical SMILES and hash code was generated using Java Apache math library HashCodeBuilder() method (better than default java hashCode() due to the flexibility). Well, you could use any method you like as long as equal objects produce same hash code and unequal objects produce distinct hash codes. Some of the most common hash code generation algorithms are MD5, SHA, PJW (Peter Weinberger’s hash) etc. The choice is made on the basis of data distribution (balance between random generation vs pattern in generation) and hashing function efficiency (should be very quick, stable and deterministic).
Now the tricky part is the conversion of hash codes into a fingerprint. I have used the famous Mersenne Twister random number generator. This yields better results than default java Random() method in terms of minimising the bit clashes and maximizing the bit set resolution.
Here are few statistical measure regarding the patterns generated and encoded into fingerprint bitsets.
|Statistical Measure (5000 mols)||Height 0||Height 1||Height 2||Height 3|
|Unique Pattern Count (UPC)||53||426||4083||14448|
|Average number of patterns/fingerprint||3.09 +/- 1.04||10.34 +/- 5.82||15.16 +/- 10.01||17.01 +/- 13.07|
|Median number of patterns/fingerprint||3||9||13||13|
|Max. number of patterns/fingerprint||7||35||64||89|
In order to understand the resolution of the fingerprints with respect to the bit clash and size of the fingerprints, I generated fingerprints of various sizes (ranging from 128 to 8192 bits). The fingerprint size 1024 bits seems like a good bet for signatures of height up to 2 (as marked in the graph below), while 4096 stands good for signature of height 3 (more than 95% bitsets are used and lesser % of bits clash).
From the above figure, it is clear that one of the key improvements which can be made in the hashed fingerprints is to divide it into sub-fingerprints. Then each sub-fingerprint can be populated with certain chemical/subgraph property of the molecule. Say in the case of molecular fingerprint of size 1024 bitset, one can divide the fingerprints into two sub-fingerprints -
a) One of 256 bits for storing labelled atom types and,
b) The second, of 768 bits for graph/topological information.
The hash code from the atom typed section is the depiction of concatenated labelled string of the CDK atom types + presence of atom in a ring system + stereo for each atom in a molecule (you could choose your own physiochemical labelling schema). The signatures/graph section can be populated with signatures/circular fingerprints of height/diameter 2. The Sub-fingerprints are easy to achieve and store with the above mentioned process due to the flexibility of generating hash codes within a range. The idea is to get the best of both the worlds i.e. physiochemical properties and subgraph patterns.
The quality of the hashed fingerprint depends a lot on the patterns generated (UPC), size of the bitsets, hashing function and random number generator. Next step for me would to cluster these similarity matrices or perform Leave One Out test on the dataset to check the specificity and sensitivity of the model.
Further reading and reference therein will give you more insight into the story:
- jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
- Hashed Fingerprints and RNG’s
- Molecular fingerprints, background
- Fingerprints – Screening and Similarity
- Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval
- Extended-Connectivity Fingerprints