EC-BLAST: A Novel Tool for Finding Chemically Similar Enzymes

Enzymes have been part of our evolutionary machinery and it’s importance is ever increasing in our life. An enzymatic hierarchal functional classification has been developed to cluster similar enzymes based on its chemistry (kindly refer to my previous blog on enzymes). A parallel system envolves sequence and protein structural based classification systems. One of the most challenging issues in todays bio/chemo informatics science is to automatically link the sequence knowledge with the enzymatic chemistry. There exists many methods in the literature addressing this issue but its hard to find a direct link which can hold true for all the cases. Although, very recently in the Prof. Janet Thornton’s group we have come up with a web tool – “FunTree” for linking enzyme super families based on the knowledge of the evolution, derived from sequences and structures (proteins and small molecules). It’s very enigmatic to find a one to one mapping between genes->protein->enzymes and its equally mind boggling to navigate in this space. This is one of the reasons why we have many orphan enzymes or enzyme which do not have a sequence assigned to it yet. On one hand we have ever increasing sequence database and sophisticated tools like BLAST and FASTA to compare them. Unfortunately, the bio-chemical side of the story is slow as we have limited number of publicly available chemical databases and tools in chemistry. Although in the recent years there has been databases like BRENDA, KEGG, BioCyc, UniProtEC->PDB and SwissProt etc. to bring forth and link sequence to chemistry. There are efforts to link up various resources of enzyme chemistry under an umbrella and one such web portal is “Enzyme Portal“. Likewise there exists, few curated databases linking enzyme function and reaction mechanism like MACiE , Rhea and SFLD etc.

The challenge for a biologist/chemist is find a tool which can function like BLAST (as a magic black box) in finding similar enzymes in a reaction database (needle in a haystack). The good new is that we have made some progress in this interesting area of research by coming up with a novel tool – “EC-BLAST“. The core idea behind this tool is to find similar enzymes ranked by similarity of the bond changes, reaction center or chemical structural similarity of the participating reactions. One could start a search with a molecule/reaction name or its structure. The Atom-Atom Mapping (AAM) is algorithmically generated on the fly for a balanced input reaction and the bond changes are automatically deduced and marked before performing any search.

EC_BLAST Front Page

EC BLAST front page

The cognisance of search results would channelise us to gain better insight into the catalytic promiscuity of the enzymes and complement the sequence based results obtained from tools like BLAST, FASTA etc (where the chemistry in not necessarily retained in the results). This will help us to link up the evolutionary and mechanistic aspects of the enzymes, in the biological findings with chemical knowledge.

Such tools will also help us gain better insight into toxicity studies (can be a value added parmeter to the likes of ChEMBL/DrugBank), in designing novel enzyme and retrosynthetic pathways etc. Although the first glimpse of the EC-BLAST was unveiled at the ISMB 2011, Vienna where it won the “Killer Apps 2011″ award, it largely remained restricted to the EBI and collaborators. The response at the ISMB 2011 (poster here) was very encouraging for us and there has been an ever increasing need, scope and requisition for such a resource. Hence, we have now decided to go public with a beta version of our web portal service.
EC-Blast result page for bond change similarity searches.

EC-Blast result page for bond change similarity searches.

Note: If you are interested in testing this service or sending us your comments or feedbacks, please do let me know!

Improved CDK Hashed Fingerprinter

Edited: 4th Nov, 10:20 AM

In my previous post, I discussed the impact of the hashcode and random number generators on a hashed fingerprint. They play a major role in the uniform distribution of the bits in a fixed length array and the occurrence of the bit clashes. In order to prove the concept, I have prepared a test case of 1200 molecules and preformed a substructure search using the default CDK Fingerprinter class and its improved Fingerprinter class version (with the Apache math librarys HashCodeBuilder() method and Mersenne Twister random number generator).

Each molecule was searched against other molecules in the dataset including itself. This was done at an interval of 200 data points. The gold standard was the substructure search results from the SMSD.

Accuracy of the Fingerprints

New Fingerprinter has better accuracy (red line) than the CDK Fingerprinter (low FPR too!)

As expected the improved version of the Fingerprinter class outperformed the present CDK Fingerprinter class. The number of false positives (FP) were reduced by 35-40% (due to minimal bit clashes) thereby increasing the accuracy of the results, while the true positives remained unchanged. This also made an overall positive impact on the speed of the search results!

The raw results and the Fingerprinter code is available via my github account https://github.com/asad/CDKHashFingerPrint.

The present code can further be optimised for lowering the number of false positives.

Thus a better hashcode and random number generator leads to an improved hashed fingerprint.

Improved Atom Typing in the Chemistry Development Kit (CDK)

Atom type test casesEnriched CDK Atom Type Model vs. ChemAxon vs. Present CDK Atom Type Model

Improved Atom typing leads to better hydrogen handling in the CDK

Updated on : 28th July 2011

A chemically valid atom typing leads to better chemistry and consistent outputs from any chemo-informatics toolkit. In my previous post, I had highlighted the performance of the CDK atom typing on the KEGG dataset and the pressing need to improve it. Mr. Nimish Gopal (from IIT Roorkee, India) has taken up this herculean task to fix the missing CDK atom types (reported in the KEGG molecules) as part of his summer internship in Prof. Thornton’s group at the EMBL-EBI. Since I am deeply involved with this project, I thought it would be fruitful for the community to know about the progress we have made in this direction.

Aim: The aim of this project was to enrich the atom typing model in the CDK.

Assumption: A valid atom typing will lead to an accurate explicit hydrogen count.

Conclusion: We have successfully added around 90 missing 124 missing/curated atom types in the CDK. They range from metals to salts, etc. You can find the atom type enriched CDK on my github CDK branch named as atomtype.

Model: We have performed cross validation using Chemaxon as gold standard. The KEGG molecules were used as test cases. Each KEGG mol file was read by the CDK; hydrogens were stripped and two cloned copies were generated. Explicit hydrogens were added using the CDK and Chemaxon on the respective copies of the cloned molecules. The explicit hydrogen count was recored and if they were empirically same then a subgraph Isomorphism search was performed on them (in order to make sure the hydrogens were placed correctly).

Result: 15499 KEGG molecules were tested and only 5 of them disagreed between the CDK and Chemaxon explicit hydrogen adder results. From the graphs its clear that the improved and enriched atom typing in the CDK outperforms the present CDK atom typing model. The new enriched atom typing model based CDK hydrogen adder also concurs with the Chemaxon hydrogen adder results.

The scatterplot and regression lines are linear as the resulting explicit hydrogen counts are same except few outliers

The failed cases are of ambiguous nature (C11065C13932C18368C18380C18384) and both softwares have different approach to handle such cases. The Chemaxon adds hydrogens to each atom in a molecules which is perceived correctly and skips ones (sets an error flag) which are not defined correctly. Whereas, CDK adds hydrogens to each atom in a molecule but exits (throws exception) as soon as it finds an untyped atom. Theoretically, they should end up giving same results but technically they differ.

The good news is that now CDK is able to atom type all the valid molecules from the KEGG database (June 2011 release). I am sure that there are few missing atom types which might crop up with some other small molecule databases ( e.g. ChEBI or PubChem etc.).

Acknowledgement: 

  • Prof. Thornton for her support and guidance.
  • I must thank Gilleain who helped Nimish to get well versed with JAVA code hierarchy in the CDK.
  • As a CDK starter, Nimish also found the Groovy book on the CDK by Egon very helpful.
  • Egon’s blog post for reporting missing CDK Atom types.
  • The Chemaxon software for granting us the license to use its hydrogen adder.
  • The SMSD for performing the isomorphism between molecules with explicit hydrogens generated by the CDK and Chemaxon.
  • The EMBL for funding this project.
We are glad to learn about the strong interest shown by the CDK community to have this work integrated back into the CDK. Thank you all for your support, we (Nimish, Gilleain and myself) have already submitted a CDK patch and it contains the following atom types.

Atom Typing with the CDK

Atom typing is an important and integral part of any chemoinformatics software. Most of the calculations, and more importantly the assignment of implicit/explicit hydrogen(s) depend on this. Thanks to Egon, Rajarshi and others, the CDK has improved a lot over the years.

I (with Nimish a summer intern) did a quick test on the KEGG molecules (before KEGG becomes a paid service on 30th June, which is a pity…..!).

The good news is that only 1.37% molecules failed the “Atom Typing” test, as compared to last year (the failure rate was approximately 10%, mostly phosphates). Most of the failed atoms are metals (Cl,N,S,Fe,Mn,Zn,Cu, etc).

Here are the raw results: https://gist.github.com/1016328

Wish list: I hope the CDK developers or chemists can fix these too!

Kindly comment and leave your suggestions.

How are enzymes classified?

How are enzymes classified?

Metabolism influences building or replacement of tissue, conversion of food to energy, disposal of waste materials, reproduction etc. “Catalysis” is defined as the acceleration of a chemical reaction by a substance which itself undergoes no permanent chemical change. Most biochemical reactions do not take place spontaneously and enzyme catalysis plays an important role in biochemical reactions necessary for all life processes. Without enzymes, these reactions would take place at a rate far too slow for effective metabolism.

Enzymes can be classified by the kind of chemical reaction they catalyze. One such scheme of enzyme classification is defined by IUBMB.

The IUBMB assigns a 4-digit code to each enzyme. Each enzyme is prefixed by EC, followed by the digits.

For exampleoxidoreductases EC 1.1.1.1

1.     The first digit denotes “Class” of the enzyme

2.     The second digit indicates, “Sub-class” of the enzyme

3.     The third digit gives “Sub sub-class” of the enzyme

4.     The fourth digit in the code is “Serial number” of the enzyme

The classification is as follows:

Group Name Type of Reaction Catalysed Example
Oxidoreductases Oxidation-reduction reactions Alcohol oxidoreductase (EC 1.1)
Transferases Transfer of functional groups Methyltransferase (EC 2.1)
Hydrolases Hydrolysis reactions Lipase (EC 3.1)
Lyases Addition to double bonds or single bonds Decarboxylases (EC 4.1)
Isomerases Isomerization reactions Epimerases and Racemases (EC 5.1)
Ligases Formation of bonds with ATP cleavage Enzymes forming carbon-oxygen bonds (EC 6.1)

b) How can I find similar enzymes?

Any similarity search is based on the presence of similar patterns (similar bond changes and/or small molecules) shared between query and target reactions. A large number of shared patterns results in higher similarity score or lesser distance score. In Bioinformatics, the concept of similarity or distance is used to find similar sequences based on amino acid similarity, structural topology, etc. In Chemoinformatics similarity between small molecules/drug molecules (i.e. based on Tanimoto score) is based on the presence of similar bonds and atoms between query and target molecules.

c) Literature

  1. Automatic Assignment of EC Numbers.
  2. Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions.
  3. Automatic Determination of Reaction Mappings and Reaction Center Information. 2. Validation on a Biochemical Reaction Database.
  4. Genome-scale classification of metabolic reactions and assignment of EC numbers with self-organizing maps.
  5. Chemical similarity searching.
  6. Quantitative comparison of catalytic mechanisms and overall reactions in convergently evolved enzymes: implications for classification of enzyme function.
  7. Using Reaction Mechanism to Measure Enzyme Similarity
  8. etc.

I reckon in the near future we might see such concepts being adapted by IUBMB itself to annotate and classify enzymes.

This would be vital in the study of the interactions between the components of biological systems (metabolites, enzymes and metabolic pathways), and how these interactions give rise to the function and behavior of that system.

As always, thoughts/suggestions are welcome!

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: