• About Me
  • Brief CV

Asad's Blog

~ Man and his will to survive!

Asad's Blog

Monthly Archives: March 2011

Benchmarking Substructure Search

15 Tuesday Mar 2011

Posted by chembioinfo in CDK, Java, Similarity, SMSD, Substructure, Work

≈ 3 Comments

Tags

CDK, Isomorphism, Java, Similarity, SMSD, substructure, VF2

Here are some interesting results as they highlight the run time between three flavours of substructure search algorithms. I have chosen, a) UIT (Universal Isomorphism Tester) as implemented in the CDK (Chemistry Development Kit); b) VF2 as implemented in the SMSD (I have further optimised it); and c) VF2 as ported from ChemKit. The former two implementations (i.e. a and b) are optimised for substructure search whereas the later implementation (i.e. c) is optimised for the graph automorphism search (further extension would be good/ideal).

While benchmarking the substructure search, I realised that the CDK AtomContainer takes ample amount of time for methods such as getConnectedAtomsList(atom) or getConnectedAtomsCount(atom). Hence, I ended up refactoring the AtomContainer to store the adjacency map of atoms and bonds. This will speed up the above mentioned methods and the benchmark results will bring out the actual differences in the run time. Please find the updated AtomContainer here (there is scope for further optimisation).

Now let’s focus on the results from the benchmark 🙂

Benchmark results from the substructure search
Benchmark results from the substructure search. X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

The comparative study/benchmark results between the three algorithms revealed some very interesting characteristics.

1) ChemKit version of the VF2 is very fast as it just looks for graph automorphisms (light green bars on the graph).

2) CDK UIT seems to perform better on graphs with atom count <= 19.

3) VF2 implementation in the SMSD performs better on graphs with atom count > 19 and yes this implementation is better off in the worst case (i.e. if no hits were reported, e.g.: first bar with 11 atoms).

Further permutation based tests will reveal more intricacies, artefacts and the innate nature of these algorithms and implementation. If you are interested in these implementations and test cases please feel free to refer to the benchmark code on the github.

Update (16 March 2011): VF2 implementation from the Chemkit now calculates substructure isomorphism. Thanks to Kyle Lutz for his feedback. The new runtime of the VF2 chemkit implementation looks as good as the one in the SMSD except that the rejection of a solution (in case of mismatch) is faster in the SMSD version of the VF2 whereas matching is faster in the former. Rest of the facts regarding size and runtime mentioned in the graph above still hold good. Below is the updated graph.

Update (17 March 2011): Further optimisation of the ChemKit ported VF2 implementation makes is fastest (reporting the matches and rejection of solution; refer to the benchmark code on the github).

Performance benchmark of substructure searches VF2 vs CDK's UIT
Performance benchmark of substructure searches VF2 vs CDK’s UIT
X-axis: 31 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures. 

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

Ultra Fast substructure search based on the VF2 outperforms CDK UIT

10 Thursday Mar 2011

Posted by chembioinfo in CDK, Java, SMSD, Substructure, Work

≈ 11 Comments

Tags

CDK, Isomorphism, Java, Similarity, SMSD, substructure, VF2

Updated (17 March 2010): Gilleain and myself were able to code an ultrafast VF2 algorithm for Isomorphism substructure search (permutation bug is resolved). The code is ported or re-factored from Chemkit and I have further optimised the code at our end. The previous VF2 code in the SMSD is more MCS oriented which was a big compromise in terms of speed while performing a substructure search. I have mentioned these issues in a previous blog. The whole idea of a substructure search is to answer a boolean question i.e. is A a subgraph/substructure of B ? If the answer is true then matching atoms are reported. On the other hand MCS where it’s about reporting the commonality between A and B.

Below is the benchmark graph which speaks about the performance (30 queries were subjected to 15030 targets, same as my previous benchmark test cases) of the VF2.

Optimised Chemkit VF2 vs CDK's UIT
Substructure benchmark results between VF and CDK’s UIT
X-axis: 30 test cases subjected to 15030 targets (results are sorted in ascending order of the query atom count). Y- axis: total percentage of time spent in calculating the substructures.

As is apparent from the benchmark graph, VF2 takes less than 10-20% of the total time (in most cases) for performing substructure searches (results are sorted in ascending order of the query atom count). So as for now we have an implementation which at last outperforms UIT (CDK code) in terms of speed for automorphic graphs. The performance of the VF2 improves as the query atom count (graph size) increases. We might need a few changes in the CDK AtomContainer class itself to further improve the speed/performance but this will involve further discussions with CDK developers.

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

How are enzymes classified?

03 Thursday Mar 2011

Posted by chembioinfo in Enzymes, Similarity, Work

≈ 8 Comments

Tags

BRENDA, EC Number, Enzyme, Enzyme Classification, IUBMB, IUBMB enzyme nomenclature, KEGG, MACiE, Reaction, Small Molecule, Tanimoto

How are enzymes classified?

Metabolism influences building or replacement of tissue, conversion of food to energy, disposal of waste materials, reproduction etc. “Catalysis” is defined as the acceleration of a chemical reaction by a substance which itself undergoes no permanent chemical change. Most biochemical reactions do not take place spontaneously and enzyme catalysis plays an important role in biochemical reactions necessary for all life processes. Without enzymes, these reactions would take place at a rate far too slow for effective metabolism.

Enzymes can be classified by the kind of chemical reaction they catalyze. One such scheme of enzyme classification is defined by IUBMB.

The IUBMB assigns a 4-digit code to each enzyme. Each enzyme is prefixed by EC, followed by the digits.

For example: oxidoreductases EC 1.1.1.1

1.     The first digit denotes “Class” of the enzyme

2.     The second digit indicates, “Sub-class” of the enzyme

3.     The third digit gives “Sub sub-class” of the enzyme

4.     The fourth digit in the code is “Serial number” of the enzyme

The classification is as follows:

Group Name Type of Reaction Catalysed Example
Oxidoreductases Oxidation-reduction reactions Alcohol oxidoreductase (EC 1.1)
Transferases Transfer of functional groups Methyltransferase (EC 2.1)
Hydrolases Hydrolysis reactions Lipase (EC 3.1)
Lyases Addition to double bonds or single bonds Decarboxylases (EC 4.1)
Isomerases Isomerization reactions Epimerases and Racemases (EC 5.1)
Ligases Formation of bonds with ATP cleavage Enzymes forming carbon-oxygen bonds (EC 6.1)

b) How can I find similar enzymes?

Any similarity search is based on the presence of similar patterns (similar bond changes and/or small molecules) shared between query and target reactions. A large number of shared patterns results in higher similarity score or lesser distance score. In Bioinformatics, the concept of similarity or distance is used to find similar sequences based on amino acid similarity, structural topology, etc. In Chemoinformatics similarity between small molecules/drug molecules (i.e. based on Tanimoto score) is based on the presence of similar bonds and atoms between query and target molecules.

c) Literature

  1. Automatic Assignment of EC Numbers.
  2. Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions.
  3. Automatic Determination of Reaction Mappings and Reaction Center Information. 2. Validation on a Biochemical Reaction Database.
  4. Genome-scale classification of metabolic reactions and assignment of EC numbers with self-organizing maps.
  5. Chemical similarity searching.
  6. Quantitative comparison of catalytic mechanisms and overall reactions in convergently evolved enzymes: implications for classification of enzyme function.
  7. Using Reaction Mechanism to Measure Enzyme Similarity
  8. etc.

I reckon in the near future we might see such concepts being adapted by IUBMB itself to annotate and classify enzymes.

This would be vital in the study of the interactions between the components of biological systems (metabolites, enzymes and metabolic pathways), and how these interactions give rise to the function and behavior of that system.

As always, thoughts/suggestions are welcome!

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Performing substructure searches?

03 Thursday Mar 2011

Posted by chembioinfo in CDK, SMSD, Substructure, Work

≈ 4 Comments

Tags

MCS, Similarity, substructure, Tanimoto

The design principle of the SMSD code was always to find Maximum Common Subgraph (MCS) between two molecules. Of course, it can be optimised to find substructures between two molecules. I was lazy in doing so but a recent post on the CDK mailing list about the SMSD based substructure search performance was a wakeup call for me.

I have made some changes to the SMSD code and now the substructure search is not that bad. Nonetheless, I have to redesign the substructure search part of the SMSD as substructure search ideally shouldn’t perform an exhaustive comparison between all possible combinations of the graph. It’s usually a boolean question i.e. is A a subgraph/substructure of B, unlike in the case of MCS where it’s about reporting the commonality between A and B.

So now one can query substructure search in the SMSD using the following code.

private static int getSMSDSolutionCount(IMolecule query, IMolecule target) throws CDKException {
SubStructure substructure = new SubStructure();

substructure.set(query, target);
if (substructure.isSubgraph(true)) {

return 1;

} else {

return 0;

}

}

The underlying substructure code can be found here

https://github.com/asad/SMSD-GUI/blob/master/src/org/openscience/smsd/SubStructure.java

New link has optimised substructure search code

https://github.com/asad/SMSD/blob/master/src/org/openscience/smsd/Substructure.java

Results: Here is a small substructure search benchmark test between UIT and SMSD. I have taken 30 query molecules and fired each of them against a decoy of 15030 molecules. At least a hit is expected from each query run.

Substructure benchmark between SMSD vs UIT

Substructure benchmark between SMSD vs UIT

In this case UIT seems to have an edge over SMSD as, in a few cases it takes more time. I have to check if the real difference is between implementation of the code or the actual algorithm itself.

You can find the results and the benchmark code here https://github.com/asad/Benchmark

The change(s) which is required in the SMSD code will involve heavy refactoring of the VF lib code. The present code works fine for the MCS but it’s time to optimise it for the substructure search. Any suggestions in optimising the VF code would be more than welcome. Please refer to the following class if you have any suggestions 🙂

Links updated: 1st Oct 2011

https://github.com/asad/SMSD/blob/master/src/org/openscience/smsd/algorithm/vflib/map/VFMapper.java

https://github.com/asad/SMSD/blob/master/src/org/openscience/smsd/algorithm/vflib/map/VFState.java

I hope someone out there will have a cracking solution!

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

MCS between N-Molecules

01 Tuesday Mar 2011

Posted by chembioinfo in SMSD, Substructure, Work

≈ 2 Comments

Tags

CDK, MCS, SMSD, substructure

Recently we have been working on extending and cleaning the SMSD command line tool. One of the priority requests from the SMSD users was to be able to obtain MCS between N-molecules. Thanks to Gilleain who has been instrumental in cleaning the code and providing a good visual output to the searches. He has implemented a circular layout for the MCS search between N-molecules. This looks cleaner and visually appealing to the users.

You can download SMSDcmd-3.0.jar from our website. Try the following command line option for obtaining MCS between N-molecules

> java -jar SMSDcmd-3.0.jar -T SDF -N Data/arom.sdf -g -b

You can download/clone the arom.sdf.

The result would look something like this

MCS between N-molecules
52.190191 0.187005

Share this:

  • Email
  • Print
  • Facebook
  • Twitter
  • LinkedIn
  • Google
  • Reddit

Like this:

Like Loading...

Subscribe

  • Entries (RSS)
  • Comments (RSS)

Archives

  • August 2015
  • December 2014
  • March 2014
  • February 2014
  • July 2012
  • April 2012
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • April 2011
  • March 2011
  • April 2010
  • December 2009

Categories

  • EC-BLAST
  • General
    • Food
    • Life
  • Work
    • CDK
      • Atom Typing
    • Enzymes
      • Similarity
    • Java
    • SMSD
    • Substructure

Meta

  • Register
  • Log in

Top Posts & Pages

  • How are enzymes classified?
  • Ultra Fast substructure search based on the VF2 outperforms CDK UIT
  • EC-BLAST: A Novel Tool for Finding Chemically Similar Enzymes
Follow Asad's Blog on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Recent

  • A Romance between Biology and Chemistry – Protein Sequences, Molecules and Enzyme function! August 15, 2015
  • ChemBLAST: Old dog new tricks December 31, 2014
  • Atom Atom Mapping (AAM) and Challenges March 18, 2014
  • EC-BLAST Tutorial for Hands-on Training February 4, 2014
  • Shortest Path and Molecular Hashed Fingerprints July 23, 2012

Archives

Tags

Atom typing bitset BRENDA CDK chemaxon circular fingerprints EBI EC Number Enzyme Classification Fingerprints hashCode() hashed fingerprints Indian Isomorphism IUBMB enzyme nomenclature Java join Structures KEGG MACiE MCS Mersenne Twister Random() Reaction Reaction Similarity Similarity Small Molecule SMSD substructure Tanimoto VF2

Blogroll

  • Andrew Dalke
  • ChEMBL
  • Christoph Steinbeck
  • Egon Willighagen
  • Gilleain Torrance
  • Joerg Kurt Wegner
  • Niyaz Ahmed
  • Noel O'Blog
  • Rajarshi Guha
  • Richard Apodaca

RSS Asad’s Blog

  • A Romance between Biology and Chemistry – Protein Sequences, Molecules and Enzyme function! August 15, 2015
    Next Generation Sequencing (NGS) data is knocking at our door and simultaneously, our ability to design novel enzymes (rational design or directed …Continue reading →
    chembioinfo
  • ChemBLAST: Old dog new tricks December 31, 2014
    Updated:07/01/2015 BLAST-Basic Local Alignment Tool was born in the 1990s (1,2) and has since been the bread and butter of …Continue reading →
    chembioinfo
  • Atom Atom Mapping (AAM) and Challenges March 18, 2014
    We have just released our long awaited AAM tool in the public domain…this was long over due! You can download the tool from …Continue reading →
    chembioinfo
  • EC-BLAST Tutorial for Hands-on Training February 4, 2014
    EC-BLAST Tutorial for Hands-on Training Publication:  EC-BLAST: a tool to automatically search and compare enzyme reactions, SA Rahman, SM Cuesta, N …Continue reading →
    chembioinfo
  • Shortest Path and Molecular Hashed Fingerprints July 23, 2012
    Shortest Path (SP) has been used in many aspects of graph traversing. The idea is to minimise the cost (number …Continue reading →
    chembioinfo
  • EC-BLAST: A Novel Tool for Finding Chemically Similar Enzymes April 11, 2012
    Enzymes have been part of our evolutionary machinery and it’s importance is ever increasing in our life. An enzymatic hierarchal …Continue reading →
    chembioinfo
  • Improved CDK Hashed Fingerprinter November 4, 2011
    Edited: 4th Nov, 10:20 AM In my previous post, I discussed the impact of the hashcode and random number generators …Continue reading →
    chembioinfo
  • Revisiting Molecular Hashed Fingerprints October 30, 2011
    Introduction Fingerprints have been widely used in various fields to find similar features. Now for those of you who are …Continue reading →
    chembioinfo
  • Thread safe SMSD September 14, 2011
    How can I run SMSD using Java Thread….is SMSD thread safe? The short answer is “YES” you can. Here is …Continue reading →
    chembioinfo
  • Indian Style Ginger Tea August 20, 2011
    Ingredients: 3 tsp. of Tea Leaves (you can use the Indian tea bags if preferred…Assam/Darjeeling etc) ½” piece of Ginger crushed …Continue reading →
    chembioinfo
I Voted

Top Posts & Pages

  • How are enzymes classified?
  • Ultra Fast substructure search based on the VF2 outperforms CDK UIT
  • EC-BLAST: A Novel Tool for Finding Chemically Similar Enzymes
Follow Asad's Blog on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Recent

  • A Romance between Biology and Chemistry – Protein Sequences, Molecules and Enzyme function! August 15, 2015
  • ChemBLAST: Old dog new tricks December 31, 2014
  • Atom Atom Mapping (AAM) and Challenges March 18, 2014
  • EC-BLAST Tutorial for Hands-on Training February 4, 2014
  • Shortest Path and Molecular Hashed Fingerprints July 23, 2012

Archives

Tags

Atom typing bitset BRENDA CDK chemaxon circular fingerprints EBI EC Number Enzyme Classification Fingerprints hashCode() hashed fingerprints Indian Isomorphism IUBMB enzyme nomenclature Java join Structures KEGG MACiE MCS Mersenne Twister Random() Reaction Reaction Similarity Similarity Small Molecule SMSD substructure Tanimoto VF2

Blogroll

  • Andrew Dalke
  • ChEMBL
  • Christoph Steinbeck
  • Egon Willighagen
  • Gilleain Torrance
  • Joerg Kurt Wegner
  • Niyaz Ahmed
  • Noel O'Blog
  • Rajarshi Guha
  • Richard Apodaca

RSS Asad’s Blog

  • A Romance between Biology and Chemistry – Protein Sequences, Molecules and Enzyme function! August 15, 2015
    Next Generation Sequencing (NGS) data is knocking at our door and simultaneously, our ability to design novel enzymes (rational design or directed …Continue reading →
    chembioinfo
  • ChemBLAST: Old dog new tricks December 31, 2014
    Updated:07/01/2015 BLAST-Basic Local Alignment Tool was born in the 1990s (1,2) and has since been the bread and butter of …Continue reading →
    chembioinfo
  • Atom Atom Mapping (AAM) and Challenges March 18, 2014
    We have just released our long awaited AAM tool in the public domain…this was long over due! You can download the tool from …Continue reading →
    chembioinfo
  • EC-BLAST Tutorial for Hands-on Training February 4, 2014
    EC-BLAST Tutorial for Hands-on Training Publication:  EC-BLAST: a tool to automatically search and compare enzyme reactions, SA Rahman, SM Cuesta, N …Continue reading →
    chembioinfo
  • Shortest Path and Molecular Hashed Fingerprints July 23, 2012
    Shortest Path (SP) has been used in many aspects of graph traversing. The idea is to minimise the cost (number …Continue reading →
    chembioinfo
  • EC-BLAST: A Novel Tool for Finding Chemically Similar Enzymes April 11, 2012
    Enzymes have been part of our evolutionary machinery and it’s importance is ever increasing in our life. An enzymatic hierarchal …Continue reading →
    chembioinfo
  • Improved CDK Hashed Fingerprinter November 4, 2011
    Edited: 4th Nov, 10:20 AM In my previous post, I discussed the impact of the hashcode and random number generators …Continue reading →
    chembioinfo
  • Revisiting Molecular Hashed Fingerprints October 30, 2011
    Introduction Fingerprints have been widely used in various fields to find similar features. Now for those of you who are …Continue reading →
    chembioinfo
  • Thread safe SMSD September 14, 2011
    How can I run SMSD using Java Thread….is SMSD thread safe? The short answer is “YES” you can. Here is …Continue reading →
    chembioinfo
  • Indian Style Ginger Tea August 20, 2011
    Ingredients: 3 tsp. of Tea Leaves (you can use the Indian tea bags if preferred…Assam/Darjeeling etc) ½” piece of Ginger crushed …Continue reading →
    chembioinfo
I Voted

Blog at WordPress.com.

Cancel
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.
%d bloggers like this: