• About Me
  • Brief CV

Asad's Blog

~ Man and his will to survive!

Asad's Blog

Category Archives: CDK

ChemBLAST: Old dog new tricks

31 Wednesday Dec 2014

Posted by chembioinfo in CDK, Java, Similarity

≈ 6 Comments

Tags

atoms, bitset, CDK, chemaxon, ChEMBL, circular fingerprints, EBI, EMBL, Similarity

Updated:07/01/2015

BLAST-Basic Local Alignment Tool was born in the 1990s (1,2) and has since been the bread and butter of homology searches (sequence similarity in large databases). Having said that, I would not like to discredit similar tools such as WU-BLAST, FASTA etc. I recently came across and read with great interest a wonderful paper from Baldi and Benz (3) highlighting the possibility of using this tool to fish out similar chemicals from databases. It touches upon a few statistical challenges and adaptations of Tanimoto scores for calculating the similarity of molecules. There are various ways for calculating chemical similarity i.e. graph based, fingerprint based etc. In today’s blog I will discuss using fingerprint based similarity methods to calculate similarity between molecules and how we can use this for BLASTing small molecules. The wiki has a rich page dedicate to the BLAST Tool.

Bringing Chemistry in the BLAST

  1. Calculate chemical fingerprints (binary: true, false) for a given molecule for size length L.
  2. Choose the indexes in the L which are set to true (for example, if bits 1, 10, 30,…. etc. are set as ‘true‘).
  3. Now transform the indexes to char terminated by ‘$’ sign ( i.e 1->B$, 10->BA$, 30->DA$ …)
  4. Concatenate the patterns to form a sequence (B$BA$DA$…).
  5. While searching transform them as byte array and match them (treating the byte until $ as one letter).
  6. Create HSP tuple patterns and align the patterns using the BLAST algorithm.

Here it’s a basic/crude step towards an un-gapped alignment search which is fast yet powerful enough to bring out significant pattern matches between fingerprints sequence arrays.

Performance

In order to test the performances of the method I have chosen ChEMBL-19 dataset containing ~1.4 million molecules.

a) Preparing the ChEMBL-19 database for BLAST search by calculating HSP regions from fingerprints.

Note: This step is only required if a new database is released.

I have used ECFP4 (CDK CircularFingerprint).

Parallel implementation of BLAST search achieves 4X speed improvement, allowing gapped alignment for better coverage of hits.

 >sh format.sh data/chembl_19.txt
 Formatting input database....
 Running Time: 197.04 sec
 Index file created chembl_19.idx
 DB formatted chembl_19.fmt

b) Performing chemical similarity search on ChEMBL database.

I have chosen the query example selected by Andrew Dalke to test his chemfp-1.2 tool.

“CHEMBL23456″= CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1

sh search.sh data/chembl_19.txt "CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1"
Running Time: 70.36 sec, 11808 hits found.
Top 10 ChEMBL 19 hits reported by ChemBLAST tool for query molecule CHEMBL23456.

Top 10 ChEMBL-19 hits reported by ChemBLAST tool for query molecule CHEMBL23456.

The top 10 Hits from ChemBLAST tool:

Query Description:
> Query: CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1
Length = 158

	                 Bit    E-
Subject Description	 Score  Value
1>  CHEMBL23456         120  4e-26
2>  CHEMBL79468        79.7  4e-14
3>  CHEMBL82471        77.5  2e-13
4>  CHEMBL82052        72.5  7e-12
5>  CHEMBL286378       71.7  1e-11
6>  CHEMBL24273        69.6  5e-11
7>  CHEMBL445364       64.5  2e-09
8>  CHEMBL180330       63.7  3e-09
9>  CHEMBL286759       62.3  8e-09
10> CHEMBL2323896      62.3  8e-09
For example top two hits are aligned as:


> CHEMBL23456
Length = 158

Score = 120 bits (316), Expect = 4e-26
Identities = 158/158 (100%), Positives = 158/158 (100%), Gaps = 0/158 (0%)

Query     1  BK-KF-BDD-CBG-CCD-CHH-CIB-DHD-DHE-DIG-DKA-ECA-ECC-EEF-EGH-EI    60
             BK-KF-BDD-CBG-CCD-CHH-CIB-DHD-DHE-DIG-DKA-ECA-ECC-EEF-EGH-EI
Sbjct     1  BK-KF-BDD-CBG-CCD-CHH-CIB-DHD-DHE-DIG-DKA-ECA-ECC-EEF-EGH-EI    60

Query    61  G-FAF-FAH-FBE-FBF-FCB-FEK-FGK-FHG-GFD-GGB-HBC-HCG-HDH-HEA-HE   120
             G-FAF-FAH-FBE-FBF-FCB-FEK-FGK-FHG-GFD-GGB-HBC-HCG-HDH-HEA-HE
Sbjct    61  G-FAF-FAH-FBE-FBF-FCB-FEK-FGK-FHG-GFD-GGB-HBC-HCG-HDH-HEA-HE   120

Query   121  G-HKG-IHA-IKI-KBE-KCG-KEH-KFA-KGH-KKH-   158
             G-HKG-IHA-IKI-KBE-KCG-KEH-KFA-KGH-KKH-
Sbjct   121  G-HKG-IHA-IKI-KBE-KCG-KEH-KFA-KGH-KKH-   158

> CHEMBL79468
Length = 154

Score = 79.7 bits (206), Expect = 4e-14
Identities = 129/149 (87%), Positives = 129/149 (87%), Gaps = 8/149 (5%)

Query     6  -BDD-CBG-CCD-CHH-CIB-DHD-DHE-DIG-DKA-ECA-ECC-EEF-EGH-EIG-FAF    65
             -B D-CBG-CCD-CHH-CIB-D  -DH -D  -DKA-E  -ECC-EEF-E  -EIG-FAF
Sbjct    10  -BHD-CBG-CCD-CHH-CIB-DGB-DHD-DHE-DKA-EAE-ECC-EEF-EIA-EIG-FAF    69

Query    66  -FAH-FBE-FBF-FCB-FEK-FGK-FHG-GFD-GGB-HBC-HCG-HDH-HEA-HE----G   121
             -FAH-FBE-FBF-FCB-FE    K-FHG-GFD-GGB-HBC-HCG-HDH-HEA-HE    G
Sbjct    70  -FAH-FBE-FBF-FCB-FE----K-FHG-GFD-GGB-HBC-HCG-HDH-HEA-HEG-HHG   125

Query   122  -HKG-IHA-IKI-KBE-KCG-KEH-KFA-   150
             -HKG-IHA-IKI-KBE-KCG-K H-K A-
Sbjct   126  -HKG-IHA-IKI-KBE-KCG-KGH-KHA-   154

Top 5 hits from Andrew Dalke‘s test case.

1 CHEMBL23456 1.00000
2 CHEMBL180330 0.94105
3 CHEMBL286759 0.93508
4 CHEMBL314618 0.93295
5 CHEMBL382002 0.92812

Thanks to John May, for providing the top 5 chemfp hits using the CDK ECFP4 (makes it an even playing ground as we both use the same fingerprints).

ChEMBL ID       Tanimoto Score
1 CHEMBL23456	1.00000
2 CHEMBL82471	0.77778
3 CHEMBL286378	0.74419
4 CHEMBL286759	0.71111
5 CHEMBL24273	0.70213

We all agree with the top hit CHEMBL23456, but we differ on the remaining top 4 hits. This could be due to the variation in choice of our fingerprints/methodology. Nonetheless, the top hits in both the cases looks interesting. I am sure there are many ways to reach the same goal and further optimisation of the code will make it even more attractive. Needs more playing around with!

Suggestions are welcome.

The ChemBLAST– tool is freely available on the GitHub for you to play with and its reasonably fast.

All things come to him who waits – provided he knows what he is waiting for. ~ Woodrow T. Wilson

—————————————————————————————————————————–

Basic principle of sequence BLAST

—————————————————————————————————————————–

To summarise a few of the important concepts:

a) It tries to compare gene sequences (amino acid or nucleotide) – query string against target string. BLAST will find conserved patterns in the database which are similar to sub-patterns in the query.

b) Its uses heuristics by calculating frequently occurring patterns called ‘high-scoring segment pair’ (HSP) in the database (using random walks). It then uses Gumbel extreme value distribution (EVD) to calculate the probability p of observing a score S equal to or greater than x. This is given by the equation:

In accordance with the Gumbel EVD, the probability p of observing a score S equal to or greater than x is given by the equation.

The statistical parameters \lambda and \mathrm{K} are estimated by fitting the distribution of the un-gapped local alignment scores of the query sequence and a lot of shuffled versions (global or local shuffling) of  database sequence patterns, to the Gumbel extreme value distribution.

c) W: Word/Pattern size – find W-mers (like n-grams) in target/query, T: Threshold – focus on pairs scoring >T, X: Drop-off – stop extending when loss > X, S: Score – the final score of segment pair.

Conditions: If W is too large, it will lead to too many patterns in L and vice versa – if too small, it will lead to too few patterns. If T is too large, then it’s very stringent (conserved blocks) and if too small, it will lead to too many extensions. Look for High-scoring Sequence Pairs (HSPs)-tuples and choose cut-off for relevant hits.

d) Find high scoring pattern of length W, and compile a list L of all W-mers that score >T with some pattern in query sequence. Then scan database for words in L and each positive hit is matched and extended. When score drops more than X below hitherto best score, stop extension. Now report all words with large score S.

References:

1) Altschul SF. et.al. Basic local alignment search tool: J Mol Biol. 1990 Oct 5;215(3):403-10. 

2) Altschul SF, et al. Gapped blast and psiblast: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. 

3) Baldi P. and Benz RW. BLASTing small molecules—statistics and extreme statistics of chemical similarity scores: Bioinformatics. Jul 1, 2008; 24(13): i357–i365.
52.205337 0.121817

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

Atom Atom Mapping (AAM) and Challenges

18 Tuesday Mar 2014

Posted by chembioinfo in Atom Typing, CDK, EC-BLAST, Enzymes, Java, Similarity, SMSD, Substructure, Work

≈ Leave a comment

Tags

AAM, Atom Atom Mapping, Atom Mapping, CDK, EC-BLAST, Enzymes, Reaction, Rhea, RXN, SMILES, SMSD

We have just released our long awaited AAM tool in the public domain…this was long over due! You can download the tool from Github. This tool is based on the algorithm published in the SA Rahman et.al. (2014) Nature Methods paper. We have successfully mapped more than 6000 KEGG reactions in the EC-BLAST.

Atom Atom Mapping (AAM) Tool can be downloaded via SF – https://github.com/asad/ReactionDecoder

The key features of this tools are:

  • Maps reactions catalysed by enzymes.
  • Generates images of the mapped reactions where matching substructures are highlighted.
  • Generates reaction patterns and bond changes for input reactions.
  • The input format can be SMILES or RXN file.
  • This is build on the SMSD and CDK, hence its pure Java (7.0), thus platform independent.

Updated: 10/Nov/2016

For example KEGG reaction R03621 coded by EC 2.3.1.2 is:

a) Mapped reaction images as png.

Atom Atom Mapping by EC-BLAST

Atom Atom Mapping by EC-BLAST

b) Reaction AAM details, Bond changes and Reaction Patterns in this file.

I have a list of challenging cases which we have logged, which I think may be interesting to put in the public domain.

For example:

Case 1: 

case1

Expected Mapping: 

[O:10]=[C:9]([OH:11])[CH2:8][CH:6]([O:5][C:3](=[O:4])[CH2:2][CH:12]([OH:14])[CH3:13])[CH3:7].[H:16][OH:1]>>[H:16][O:5][CH:6]([CH3:7])[CH2:8][C:9](=[O:10])[OH:11].[O:4]=[C:3]([OH:1])[CH2:2][CH:12]([OH:14])[CH3:13]

Case 2: 

case2

Expected Mapping:

[O:10]=[C:8]([OH:7])[C:11]([O:12][CH:13]1[CH:9]=[CH:5][CH:14]=[C:15]([C:6](=[O:4])[OH:3])[CH:16]1[OH:2])=[CH2:1]>>[O:10]=[C:8]([OH:7])[C:11]([O:12][CH:13]1[CH:16]=[C:15]([CH:14]=[CH:5][CH:9]1[OH:2])[C:6](=[O:4])[OH:3])=[CH2:1]

Few test case can be found here https://github.com/asad/AAMTool.

I would like to hear from the developers and users….Challenges and Use cases!

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

Shortest Path and Molecular Hashed Fingerprints

23 Monday Jul 2012

Posted by chembioinfo in CDK, Java, Similarity, Work

≈ 7 Comments

Tags

atoms, bonds, CDK, Fingerprints, graph, hash, hashed fingerprints, Java, molecule, network, shortest path, Small Molecule

Shortest Path (SP) has been used in many aspects of graph traversing. The idea is to minimise the cost (number of edges to be traversed or the cost on the edge) of traveling between a source and destination. This is one of the most optimal ways of finding a path in the graph where you can generate a combination of paths using random walk.

Interestingly, generation of path based hashed fingerprint is very common in the area of chemo-informatics. The basic idea is to find all paths of a certain length from the source atom (fragments) in a molecule and convert it into a hashed fingerprint. This works very well with smaller or sparsed graphs, although in a few cases the run time may increase exponentially with the size of the graph (connectivity). The Chemistry Development Kit (CDK) has one such effective path based hashed fingerprint generator (Fingerprinter.java). This module in the CDK has generated a lot of interest from the user community. Recently, Nina Nikolova – Jeliazkova posted an interesting set of molecules were the path search in the fingerprint was hit by combinatorial explosion!

Note: The behaviour of the path finding algorithm is compromised once the depth of the path search is more than 6 (the recommended depth is 8). Hence for these set of molecules one may not be able to find the fingerprints.

Here are the exemplar molecules where the present CDK hashed fingerprint is subdued.

git://gist.github.com/3163031.git

I have modified the existing CDK fingerprinter to report the shortest path rather than all paths. This overcomes the problem of combinatory explosion and runtime is no longer exponential as compared to previous case.

Here is the runtime and the density of the FP  (number of bit occupied) as calculated by the SP based FP. One can deduce the runtime and density by half if the FP is only based either on weighted or unweighted bonds. In the modified SP based FP, I have used both weighted and unweighted bonds to give better consensus FP (more in my next blog!).

Here is the code.

https://github.com/asad/ShortestPathMoleculeFingerprinter

Challenges:

a) Presently the fingerprint accounts for only one shortest path between a source and the sink atom (discriminates between aromatic, ring and aliphatic paths). Hence, I had to canonicalize the atoms in the graph container such that if two molecules are similar then the returned SP path is same. A natural extension would be to report k-shortest path but this maybe as good as CDK default fingerprinter (in terms of the runtime).

b) For spare graph and smaller graphs it might be as fast as the previous implementation, and it will perform better on complex graphs.

c) My present implementation is an extension of my previous work on the CDK fingerprinter where rings search and other optimizations has been done.

Edited: 6th August 2012

Here is a test case based on the ring systems (aromatic and non aromatic) and aliphatic molecules.

Updated: 29th Aug 2012

Thanks to Egon for his suggestions to use SP2 hybridization instead of aromaticity checker. In my case I have to use CDK aromaticity detection as SP2 concept may not work.

I have clustered 11 molecules based on their fingerprint similarity scores using the

a) CDK default finprinter,

b) SP based Fingerprinter and

c) CDK Hybridization Fingerprinter

The clustered results are as shown below.

The CDK default fingerprinter based similarity clusters

The CDK default fingerprinter based similarity clusters

The Shortest path fingerprinter based similarity clusters

The Shortest path fingerprinter based similarity clusters

Molecule similarity clusters based on the Hybridization Fingerprinter

Molecule similarity clusters based on the Hybridization Fingerprinter (doesn’t discriminate between open and close ring system)

The Hybridization based fingerprinter is the fastest one (in the non-complex cases), followed by the SP fingerprinter and improved CDK fingerprinter. In terms of the sensitivity and specificity, SP fingerprinter is the best and in complex cases its by far the fast one!

I will leave it to the readers to choose their favorite fingerprinter.

Kindly leave your comments and suggestions!

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

Improved CDK Hashed Fingerprinter

04 Friday Nov 2011

Posted by chembioinfo in CDK, Java, Similarity, SMSD, Substructure

≈ 4 Comments

Tags

CDK, hashCode(), hashed fingerprints, Isomorphism, Java, KEGG, Mersenne Twister, Random(), Similarity, SMSD, substructure, Tanimoto, VF2

Edited: 4th Nov, 10:20 AM

In my previous post, I discussed the impact of the hashcode and random number generators on a hashed fingerprint. They play a major role in the uniform distribution of the bits in a fixed length array and the occurrence of the bit clashes. In order to prove the concept, I have prepared a test case of 1200 molecules and preformed a substructure search using the default CDK Fingerprinter class and its improved Fingerprinter class version (with the Apache math librarys HashCodeBuilder() method and Mersenne Twister random number generator).

Each molecule was searched against other molecules in the dataset including itself. This was done at an interval of 200 data points. The gold standard was the substructure search results from the SMSD.

Accuracy of the Fingerprints

New Fingerprinter has better accuracy (red line) than the CDK Fingerprinter (low FPR too!)

As expected the improved version of the Fingerprinter class outperformed the present CDK Fingerprinter class. The number of false positives (FP) were reduced by 35-40% (due to minimal bit clashes) thereby increasing the accuracy of the results, while the true positives remained unchanged. This also made an overall positive impact on the speed of the search results!

The raw results and the Fingerprinter code is available via my github account https://github.com/asad/CDKHashFingerPrint.

The present code can further be optimised for lowering the number of false positives.

Thus a better hashcode and random number generator leads to an improved hashed fingerprint.

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

Revisiting Molecular Hashed Fingerprints

30 Sunday Oct 2011

Posted by chembioinfo in Atom Typing, CDK, Java, Similarity, Substructure, Work

≈ 1 Comment

Tags

Atom typing, bitset, circular fingerprints, Fingerprints, hashCode(), hashed fingerprints, hashing, Java, Mersenne Twister, molecular signatures, Random(), rings, RNG's, Similarity, SMARTS, stereo, subfingerprints, substructure

Introduction

Fingerprints have been widely used in various fields to find similar features. Now for those of you who are using their detective instincts and aiming for DNA fingerprint or biological fingerprints, I might disappoint you in the later half of my post. Fingerprints are typically used to avoid cumbersome data comparison by using shorter “bit” string. My focus will be on the molecular fingerprints which have been used by chemo/bio informatician for finding similar molecular structures i.e. finding a needle in a hay stack! Theoretically, if you know the prerequisite features of “should have and not have” in the target molecules, then you can use a set of predefined keys to generate fingerprints. For examples PubChem fingerprint, MACCS keys etc. are based on certain substructure/SMARTS keys which are expected to be found or skipped in your target. On the other hand when we play with unknowns both at the level of query and target then one of the fastest ways to go for the kill is hashed fingerprints. Typically, in a hashed fingerprint a set of patterns are generated by gathering atom environment information or subgraph information or both. The generated patterns are then transformed into hash codes (a fixed size message digest) using hashing algorithm in computer science. These hash codes can then transformed into bit strings using random number generation of a defined length (size of the fingerprint). The presence and the absence of a pattern is marked as “1” and “0” respectively.

Pros:

  • Hashed fingerprints are like a black box with an assurance that similar patterns will have similar bits set to “1”. In the language of information science you are allowing clashes of the similar bits with certain probability.
  • The size of the generated fingerprints can be controlled by the user as predefined knowledge of the fingerprint patterns are not required.
Cons:
  • The resolution of the fingerprints depends on algorithms used for generating the hash code and random numbers.
  • It’s challenging to find a perfectly sized fingerprints which can strike a balance between minimising the clashes of bitsets and wastage of the bit space.

Implementation

Let’s play with some real-time examples to understand the depth of the above mentioned statements. Now we need to generate some patterns from molecules and store them as fingerprints. In order to analyse the quality of the fingerprints we will open the black box by keeping track of the generated pattern types. This will help us to quantify the patterns involved in the bitset clashes. The circular fingerprint or molecular signatures can be used to generate patterns of various diameter/height for a molecule. By increasing the diameter/height, we can enrich the patterns/information about the molecules. However, this will also increase the overhead of balancing the fingerprint size and reducing the bit clashes.

Stage 1: Generate patterns using molecular signatures of heights 0 to 3 for every atom in the molecule. An example is illustrated in the figure below.

Circular / Signatures patterns encoded as fingerprints
Circular / Signatures patterns encoded as fingerprints

Stage 2: Transform these patterns as SMARTS/SMILES/Signatures and generate hash code for each pattern using your favourite algorithm.

Stage 3: Once we have the hash codes for these patterns then using random number generator, convert these hash codes into bit set bucket with a fixed range (eg. 1024).

I have used the CDK to generate molecular signatures (σ) of various heights (0 to 3) for 5000 mols. These signatures were transformed into canonical SMILES and hash code was generated using Java Apache math library HashCodeBuilder() method (better than default java hashCode() due to the flexibility). Well, you could use any method you like as long as equal objects produce same hash code and unequal objects produce distinct hash codes. Some of the most common hash code generation algorithms are MD5, SHA, PJW (Peter Weinberger’s hash) etc. The choice is made on the basis of data distribution (balance between random generation vs pattern in generation) and hashing function efficiency (should be very quick, stable and deterministic).

Now the tricky part is the conversion of hash codes into a fingerprint. I have used the famous Mersenne Twister random number generator. This yields better results than default java Random() method in terms of minimising the bit clashes and maximizing the bit set resolution.

Here are few statistical measure regarding the patterns generated and encoded into fingerprint bitsets.

Statistical Measure (5000 mols) Height 0 Height 1 Height 2 Height 3
Unique Pattern Count (UPC) 53 426 4083 14448
Average number of patterns/fingerprint 3.09 +/- 1.04 10.34 +/- 5.82 15.16 +/- 10.01 17.01 +/- 13.07
Median number of patterns/fingerprint 3 9 13 13
Max. number of patterns/fingerprint 7 35 64 89

In order to understand the resolution of the fingerprints with respect to the bit clash and size of the fingerprints, I generated fingerprints of various sizes (ranging from 128 to 8192 bits). The fingerprint size 1024 bits seems like a good bet for signatures of height up to 2 (as marked in the graph below), while 4096 stands good for signature of height 3 (more than 95% bitsets are used and lesser % of bits clash).

BitSet usage vs Bit Clash in the hashed fingerprints
BitSet usage vs Bit Clash in the hashed fingerprints

Analysis

From the above figure, it is clear that one of the key improvements which can be made in the hashed fingerprints is to divide it into sub-fingerprints. Then each sub-fingerprint can be populated with certain chemical/subgraph property of the molecule. Say in the case of molecular fingerprint of size 1024 bitset, one can divide the fingerprints into two sub-fingerprints –

a) One of 256 bits for storing labelled atom types and,

b) The second, of 768 bits for graph/topological information.

The hash code from the atom typed section is the depiction of concatenated labelled string of the CDK atom types + presence of atom in a ring system + stereo for each atom in a molecule (you could choose your own physiochemical labelling schema). The signatures/graph section can be populated with signatures/circular fingerprints of height/diameter 2. The Sub-fingerprints are easy to achieve and store with the above mentioned process due to the flexibility of generating hash codes within a range. The idea is to get the best of both the worlds i.e. physiochemical properties and subgraph patterns.

Conclusion

The quality of the hashed fingerprint depends a lot on the patterns generated (UPC), size of the bitsets, hashing function and random number generator. Next step for me would to cluster these similarity matrices or perform Leave One Out test on the dataset to check the specificity and sensitivity of the model.

References:

Further reading and reference therein will give you more insight into the story:

  1. jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
  2. Hashed Fingerprints and RNG’s
  3. Molecular fingerprints, background
  4. Fingerprints – Screening and Similarity
  5. Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval
  6. Extended-Connectivity Fingerprints
  7. more…..
Please feel free to make suggestions and leave your thoughts/comments.

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

Thread safe SMSD

14 Wednesday Sep 2011

Posted by chembioinfo in CDK, Java, Similarity, SMSD, Substructure, Work

≈ Leave a comment

Tags

CDK, Isomorphism, Java, join Structures, MCS, Similarity, SMSD, substructure, Tanimoto, VF2

How can I run SMSD using Java Thread….is SMSD thread safe?

The short answer is “YES” you can.

Here is a snippet to demonstrate that SMSD is thread safe.

https://github.com/asad/ThreadSMSD

The latest version of the SMSD is not also thread safe but it also comes with optimised matchers (bond and atom).

For, example you could demand it to respect the ring matching (restrict) while finding MCS.

Thus an MCS between a query SMILES “ONC1=CC=CC=C1” and target SMILES “O1C=CC=CN1C1=CC=CC=C1”.

a) without ring matcher is “ONC1=CC=CC=C1”

b) with ring matcher is “C1=CC=CC=C1”

This might be useful for few SMSD users.

Find MCS without ring matching restrictions

MCS with ring matcher OFF flag.

Respect rings while finding MCS

Respect rings while finding MCS (non ring to ring matches are skipped). MCS with ring matcher ON flag.

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

Improved Atom Typing in the Chemistry Development Kit (CDK)

06 Wednesday Jul 2011

Posted by chembioinfo in Atom Typing, CDK, Java, Similarity, SMSD, Substructure, Work

≈ Leave a comment

Tags

Atom typing, CDK, chemaxon, explicit hydrogen, Hydrogen adder, Java, join Structures, KEGG, SMSD, substructure

Atom type test casesEnriched CDK Atom Type Model vs. ChemAxon vs. Present CDK Atom Type Model

Improved Atom typing leads to better hydrogen handling in the CDK

Updated on : 28th July 2011

A chemically valid atom typing leads to better chemistry and consistent outputs from any chemo-informatics toolkit. In my previous post, I had highlighted the performance of the CDK atom typing on the KEGG dataset and the pressing need to improve it. Mr. Nimish Gopal (from IIT Roorkee, India) has taken up this herculean task to fix the missing CDK atom types (reported in the KEGG molecules) as part of his summer internship in Prof. Thornton’s group at the EMBL-EBI. Since I am deeply involved with this project, I thought it would be fruitful for the community to know about the progress we have made in this direction.

Aim: The aim of this project was to enrich the atom typing model in the CDK.

Assumption: A valid atom typing will lead to an accurate explicit hydrogen count.

Conclusion: We have successfully added around 90 missing 124 missing/curated atom types in the CDK. They range from metals to salts, etc. You can find the atom type enriched CDK on my github CDK branch named as atomtype.

Model: We have performed cross validation using Chemaxon as gold standard. The KEGG molecules were used as test cases. Each KEGG mol file was read by the CDK; hydrogens were stripped and two cloned copies were generated. Explicit hydrogens were added using the CDK and Chemaxon on the respective copies of the cloned molecules. The explicit hydrogen count was recored and if they were empirically same then a subgraph Isomorphism search was performed on them (in order to make sure the hydrogens were placed correctly).

Result: 15499 KEGG molecules were tested and only 5 of them disagreed between the CDK and Chemaxon explicit hydrogen adder results. From the graphs its clear that the improved and enriched atom typing in the CDK outperforms the present CDK atom typing model. The new enriched atom typing model based CDK hydrogen adder also concurs with the Chemaxon hydrogen adder results.

The scatterplot and regression lines are linear as the resulting explicit hydrogen counts are same except few outliers

The failed cases are of ambiguous nature (C11065, C13932, C18368, C18380, C18384) and both softwares have different approach to handle such cases. The Chemaxon adds hydrogens to each atom in a molecules which is perceived correctly and skips ones (sets an error flag) which are not defined correctly. Whereas, CDK adds hydrogens to each atom in a molecule but exits (throws exception) as soon as it finds an untyped atom. Theoretically, they should end up giving same results but technically they differ.

The good news is that now CDK is able to atom type all the valid molecules from the KEGG database (June 2011 release). I am sure that there are few missing atom types which might crop up with some other small molecule databases ( e.g. ChEBI or PubChem etc.).

Acknowledgement: 

  • Prof. Thornton for her support and guidance.
  • I must thank Gilleain who helped Nimish to get well versed with JAVA code hierarchy in the CDK.
  • As a CDK starter, Nimish also found the Groovy book on the CDK by Egon very helpful.
  • Egon’s blog post for reporting missing CDK Atom types.
  • The Chemaxon software for granting us the license to use its hydrogen adder.
  • The SMSD for performing the isomorphism between molecules with explicit hydrogens generated by the CDK and Chemaxon.
  • The EMBL for funding this project.
We are glad to learn about the strong interest shown by the CDK community to have this work integrated back into the CDK. Thank you all for your support, we (Nimish, Gilleain and myself) have already submitted a CDK patch and it contains the following atom types.

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

Atom Typing with the CDK

09 Thursday Jun 2011

Posted by chembioinfo in Atom Typing, CDK, Java, Work

≈ 4 Comments

Tags

Atom typing, CDK, Hydrogens, Java, KEGG

Atom typing is an important and integral part of any chemoinformatics software. Most of the calculations, and more importantly the assignment of implicit/explicit hydrogen(s) depend on this. Thanks to Egon, Rajarshi and others, the CDK has improved a lot over the years.

I (with Nimish a summer intern) did a quick test on the KEGG molecules (before KEGG becomes a paid service on 30th June, which is a pity…..!).

The good news is that only 1.37% molecules failed the “Atom Typing” test, as compared to last year (the failure rate was approximately 10%, mostly phosphates). Most of the failed atoms are metals (Cl,N,S,Fe,Mn,Zn,Cu, etc).

Here are the raw results: https://gist.github.com/1016328

Wish list: I hope the CDK developers or chemists can fix these too!

Kindly comment and leave your suggestions.

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

Join/Fuse two molecules

23 Saturday Apr 2011

Posted by chembioinfo in CDK, Java, Similarity, SMSD, Substructure, Work

≈ Leave a comment

Tags

CDK, Java, join Structures, MCS, Similarity, Small Molecule, SMSD, substructure, Union

Lately, an interesting question was posed to me.

How can I join two molecules?

Well for many of us this might be a regular exercise on a chemical editor but this is a little trickier to program. So mathematically, we are aiming for a union between two molecules. The union between two molecules can be defined as (A ∪ B) = n(A) + n(B) – (A ∩ B), where A is a query molecule and B is the target molecule. The intersection (A ∩ B) can be obtained by finding isomorphism between two molecules. The joining part is a little challenging as not all combination(s) might be chemically valid. So we need to find combination(s) which are unique and unsaturated!

Here is a little snippet to perform this task in Java using the SMSD and CDK.

You can download the java source UnionMolecule and java executable Union.jar.

Union between query and target molecule

Union between query and target molecules

Usage example:

java -jar Union.jar "OOC1=CC=CC=C1" "c1ccc(cc1)c2ccccc2"

Results:

------------- Combination 1--------------------
Query SMILES OOC1=CC=CC=C1, count 8
Target SMILES c1ccc(cc1)c2ccccc2, count 12
Union SMILES OOc1ccccc1c2ccccc2, count 14

------------- Combination 2--------------------
Query SMILES OOC1=CC=CC=C1, count 8
Target SMILES c1ccc(cc1)c2ccccc2, count 12
Union SMILES OOc1cccc(c1)c2ccccc2, count 14

------------- Combination 3--------------------
Query SMILES OOC1=CC=CC=C1, count 8
Target SMILES c1ccc(cc1)c2ccccc2, count 12
Union SMILES OOc1ccc(cc1)c2ccccc2, count 14


Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

Benchmarking Substructure Search

15 Tuesday Mar 2011

Posted by chembioinfo in CDK, Java, Similarity, SMSD, Substructure, Work

≈ 3 Comments

Tags

CDK, Isomorphism, Java, Similarity, SMSD, substructure, VF2

Here are some interesting results as they highlight the run time between three flavours of substructure search algorithms. I have chosen, a) UIT (Universal Isomorphism Tester) as implemented in the CDK (Chemistry Development Kit); b) VF2 as implemented in the SMSD (I have further optimised it); and c) VF2 as ported from ChemKit. The former two implementations (i.e. a and b) are optimised for substructure search whereas the later implementation (i.e. c) is optimised for the graph automorphism search (further extension would be good/ideal).

While benchmarking the substructure search, I realised that the CDK AtomContainer takes ample amount of time for methods such as getConnectedAtomsList(atom) or getConnectedAtomsCount(atom). Hence, I ended up refactoring the AtomContainer to store the adjacency map of atoms and bonds. This will speed up the above mentioned methods and the benchmark results will bring out the actual differences in the run time. Please find the updated AtomContainer here (there is scope for further optimisation).

Now let’s focus on the results from the benchmark 🙂

Benchmark results from the substructure search
Benchmark results from the substructure search. X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

The comparative study/benchmark results between the three algorithms revealed some very interesting characteristics.

1) ChemKit version of the VF2 is very fast as it just looks for graph automorphisms (light green bars on the graph).

2) CDK UIT seems to perform better on graphs with atom count <= 19.

3) VF2 implementation in the SMSD performs better on graphs with atom count > 19 and yes this implementation is better off in the worst case (i.e. if no hits were reported, e.g.: first bar with 11 atoms).

Further permutation based tests will reveal more intricacies, artefacts and the innate nature of these algorithms and implementation. If you are interested in these implementations and test cases please feel free to refer to the benchmark code on the github.

Update (16 March 2011): VF2 implementation from the Chemkit now calculates substructure isomorphism. Thanks to Kyle Lutz for his feedback. The new runtime of the VF2 chemkit implementation looks as good as the one in the SMSD except that the rejection of a solution (in case of mismatch) is faster in the SMSD version of the VF2 whereas matching is faster in the former. Rest of the facts regarding size and runtime mentioned in the graph above still hold good. Below is the updated graph.

Update (17 March 2011): Further optimisation of the ChemKit ported VF2 implementation makes is fastest (reporting the matches and rejection of solution; refer to the benchmark code on the github).

Performance benchmark of substructure searches VF2 vs CDK's UIT
Performance benchmark of substructure searches VF2 vs CDK’s UIT
X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures. 

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...
← Older posts

Subscribe

  • Entries (RSS)
  • Comments (RSS)

Archives

  • August 2015
  • December 2014
  • March 2014
  • February 2014
  • July 2012
  • April 2012
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • April 2011
  • March 2011
  • April 2010
  • December 2009

Categories

  • EC-BLAST
  • General
    • Food
    • Life
  • Work
    • CDK
      • Atom Typing
    • Enzymes
      • Similarity
    • Java
    • SMSD
    • Substructure

Meta

  • Register
  • Log in

Top Posts & Pages

  • How are enzymes classified?
  • Ultra Fast substructure search based on the VF2 outperforms CDK UIT
  • EC-BLAST: A Novel Tool for Finding Chemically Similar Enzymes
Follow Asad's Blog on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Recent

  • A Romance between Biology and Chemistry – Protein Sequences, Molecules and Enzyme function! August 15, 2015
  • ChemBLAST: Old dog new tricks December 31, 2014
  • Atom Atom Mapping (AAM) and Challenges March 18, 2014
  • EC-BLAST Tutorial for Hands-on Training February 4, 2014
  • Shortest Path and Molecular Hashed Fingerprints July 23, 2012

Archives

Tags

Atom typing bitset BRENDA CDK chemaxon circular fingerprints EBI EC Number Enzyme Classification Fingerprints hashCode() hashed fingerprints Indian Isomorphism IUBMB enzyme nomenclature Java join Structures KEGG MACiE MCS Mersenne Twister Random() Reaction Reaction Similarity Similarity Small Molecule SMSD substructure Tanimoto VF2

Blogroll

  • Andrew Dalke
  • ChEMBL
  • Christoph Steinbeck
  • Egon Willighagen
  • Gilleain Torrance
  • Joerg Kurt Wegner
  • Niyaz Ahmed
  • Noel O'Blog
  • Rajarshi Guha
  • Richard Apodaca

RSS Asad’s Blog

  • A Romance between Biology and Chemistry – Protein Sequences, Molecules and Enzyme function! August 15, 2015
    Next Generation Sequencing (NGS) data is knocking at our door and simultaneously, our ability to design novel enzymes (rational design or directed …Continue reading →
    chembioinfo
  • ChemBLAST: Old dog new tricks December 31, 2014
    Updated:07/01/2015 BLAST-Basic Local Alignment Tool was born in the 1990s (1,2) and has since been the bread and butter of …Continue reading →
    chembioinfo
  • Atom Atom Mapping (AAM) and Challenges March 18, 2014
    We have just released our long awaited AAM tool in the public domain…this was long over due! You can download the tool from …Continue reading →
    chembioinfo
  • EC-BLAST Tutorial for Hands-on Training February 4, 2014
    EC-BLAST Tutorial for Hands-on Training Publication:  EC-BLAST: a tool to automatically search and compare enzyme reactions, SA Rahman, SM Cuesta, N …Continue reading →
    chembioinfo
  • Shortest Path and Molecular Hashed Fingerprints July 23, 2012
    Shortest Path (SP) has been used in many aspects of graph traversing. The idea is to minimise the cost (number …Continue reading →
    chembioinfo
  • EC-BLAST: A Novel Tool for Finding Chemically Similar Enzymes April 11, 2012
    Enzymes have been part of our evolutionary machinery and it’s importance is ever increasing in our life. An enzymatic hierarchal …Continue reading →
    chembioinfo
  • Improved CDK Hashed Fingerprinter November 4, 2011
    Edited: 4th Nov, 10:20 AM In my previous post, I discussed the impact of the hashcode and random number generators …Continue reading →
    chembioinfo
  • Revisiting Molecular Hashed Fingerprints October 30, 2011
    Introduction Fingerprints have been widely used in various fields to find similar features. Now for those of you who are …Continue reading →
    chembioinfo
  • Thread safe SMSD September 14, 2011
    How can I run SMSD using Java Thread….is SMSD thread safe? The short answer is “YES” you can. Here is …Continue reading →
    chembioinfo
  • Indian Style Ginger Tea August 20, 2011
    Ingredients: 3 tsp. of Tea Leaves (you can use the Indian tea bags if preferred…Assam/Darjeeling etc) ½” piece of Ginger crushed …Continue reading →
    chembioinfo
I Voted

Top Posts & Pages

  • How are enzymes classified?
  • Ultra Fast substructure search based on the VF2 outperforms CDK UIT
  • EC-BLAST: A Novel Tool for Finding Chemically Similar Enzymes
Follow Asad's Blog on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Recent

  • A Romance between Biology and Chemistry – Protein Sequences, Molecules and Enzyme function! August 15, 2015
  • ChemBLAST: Old dog new tricks December 31, 2014
  • Atom Atom Mapping (AAM) and Challenges March 18, 2014
  • EC-BLAST Tutorial for Hands-on Training February 4, 2014
  • Shortest Path and Molecular Hashed Fingerprints July 23, 2012

Archives

Tags

Atom typing bitset BRENDA CDK chemaxon circular fingerprints EBI EC Number Enzyme Classification Fingerprints hashCode() hashed fingerprints Indian Isomorphism IUBMB enzyme nomenclature Java join Structures KEGG MACiE MCS Mersenne Twister Random() Reaction Reaction Similarity Similarity Small Molecule SMSD substructure Tanimoto VF2

Blogroll

  • Andrew Dalke
  • ChEMBL
  • Christoph Steinbeck
  • Egon Willighagen
  • Gilleain Torrance
  • Joerg Kurt Wegner
  • Niyaz Ahmed
  • Noel O'Blog
  • Rajarshi Guha
  • Richard Apodaca

RSS Asad’s Blog

  • A Romance between Biology and Chemistry – Protein Sequences, Molecules and Enzyme function! August 15, 2015
    Next Generation Sequencing (NGS) data is knocking at our door and simultaneously, our ability to design novel enzymes (rational design or directed …Continue reading →
    chembioinfo
  • ChemBLAST: Old dog new tricks December 31, 2014
    Updated:07/01/2015 BLAST-Basic Local Alignment Tool was born in the 1990s (1,2) and has since been the bread and butter of …Continue reading →
    chembioinfo
  • Atom Atom Mapping (AAM) and Challenges March 18, 2014
    We have just released our long awaited AAM tool in the public domain…this was long over due! You can download the tool from …Continue reading →
    chembioinfo
  • EC-BLAST Tutorial for Hands-on Training February 4, 2014
    EC-BLAST Tutorial for Hands-on Training Publication:  EC-BLAST: a tool to automatically search and compare enzyme reactions, SA Rahman, SM Cuesta, N …Continue reading →
    chembioinfo
  • Shortest Path and Molecular Hashed Fingerprints July 23, 2012
    Shortest Path (SP) has been used in many aspects of graph traversing. The idea is to minimise the cost (number …Continue reading →
    chembioinfo
  • EC-BLAST: A Novel Tool for Finding Chemically Similar Enzymes April 11, 2012
    Enzymes have been part of our evolutionary machinery and it’s importance is ever increasing in our life. An enzymatic hierarchal …Continue reading →
    chembioinfo
  • Improved CDK Hashed Fingerprinter November 4, 2011
    Edited: 4th Nov, 10:20 AM In my previous post, I discussed the impact of the hashcode and random number generators …Continue reading →
    chembioinfo
  • Revisiting Molecular Hashed Fingerprints October 30, 2011
    Introduction Fingerprints have been widely used in various fields to find similar features. Now for those of you who are …Continue reading →
    chembioinfo
  • Thread safe SMSD September 14, 2011
    How can I run SMSD using Java Thread….is SMSD thread safe? The short answer is “YES” you can. Here is …Continue reading →
    chembioinfo
  • Indian Style Ginger Tea August 20, 2011
    Ingredients: 3 tsp. of Tea Leaves (you can use the Indian tea bags if preferred…Assam/Darjeeling etc) ½” piece of Ginger crushed …Continue reading →
    chembioinfo
I Voted

Blog at WordPress.com.

Cancel
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.
%d bloggers like this: