MolClass

From Mike Tyers Laboratory Software Documentation

Jump to: navigation, search

Rapid Molecule Classification Based on Structure and Activity

MolClass generates computational models from small molecule datasets using structural features identified in hit and non-hit molecules. In contrast to existing experimental resources like PubChem and Chembank, MolClass aims to present the user with a likelihood value for each molecule entry. This creates an activity fingerprint that currently includes models for Ames mutagenicity, blood brain barrier penetration, CaCo2 penetration (derived from Hou et al.), stem cell neurosphere proliferation (derived from Diamandis et al.), Autofluorescence Model (derived from ChemBank data), Flucanozole synergy predictive model (derived from Spitzer et al.) and a toxcicity benchmark.

In addition we uploaded some example datasets from Pubchem to build a P. falicarum sensitivity data model (derived from Yuan et al.) and a Hsp90 co-chaperone disruptor screen. From the NCI funded database ChemBank, we incorporated data from a Cell Cycle Inhibitor, a Beta Cell Transdifferentiation model, Xenopus Actin Polymerization dataset and a Thrombin Acitivity Predictive Model (derived from ChEMBL data). These and future models can help to guide compound selection for follow up screens and library design. Most computer-aided ventures overlook promiscuous binding to off-target proteins that results in side effects of a drug. Those compounds will be visible in the approach we have taken. We hope that our portlet will help to guide scientists in the systems- and chemical biology community.

Contents

Frequently Asked Questions

Why do I have to register to upload a dataset?

Users wishing to upload data and generate models have to generate a login to assure that datasets can be assigned to users to ensure quality of datasets uploaded and to be able to offer help in case of processing errors. Further, the calculations may take few days depending on the size of the submission and on the number of models being used. Upon completion of model generation, the user will be notified by email.

How many parameters should I use to generate a machine learning model? Which model would you recommend to use?

For small molecule sets with less than 100 molecules we suggest to use either the descriptors or structural features as input parameters. Further, the type of learner influences the bias of the learner. From our experiences, RandomForest is the most robust according to feature space followed by implemented support vector machine algorithms (LibSVM and SMO). All others perform reasonably well, occasionally J48, Naive Bayes tend to 'overlearn' the datasets.

My dataset has very few active molecules and many inactive ones. How can I build a balanced learner?

MolClass tolerates a maximum class imbalance ratio of 1:10 to build reliable models. Imbalances beyond that level will be subsampled to build a predictive model.

What is a good dataset size and what is the maximum I can submit?

We recommend that users submit, if possible, datasets of a few thousand molecules. The maximum single dataset submission size is currently limited to 20,000 molecules.

I have screened a vendor library designed for a few protein targets. Can I still build a model?

The diversity of a library plays a major role for model quality since a probabilistic space directly depends on the given sample set. If a model has been learned from a focused chemical library, the descriptor space and chemical feature space will be very limited. Therefore the likelihood values for these molecules in a different model tend to be small because of additional features that were represented in the dataset that the model is based on. If this scenario applies to a majority of likelihood values across several datasets in MolClass it suggests that the model providing the values has low predictive power.

How can the likelihoods be interpreted?

The likelihood score is the logarithm of the odds-ratio of being in class A versus class B. Most of current models in MolClass consider class A or class B as active versus inactive. For those models, the class tags used are either active versus inactive or an activity description versus mutual description. A few of the models represent small molecule activities as inhibitors (class A) and activators (class B) or vice versa. In those cases, a likelihood close to zero suggests inactivity.

Where do the probabilities come from for non-probabilistic learners like decision trees?

Non-probabilistic approaches like tree and rule learners can use the frequency of observed class values occurring at the leaves/antecedents of the tree/rules to obtain probabilities. Weka uses a linear regression model to provide probabilities for the SVM's

If I use several classification algorithms can I somehow combine them?

Currently, the method only allows separate (independent) model submissions. Those will generate model-specific likelihoods. Users can decide if they want to weigh these likelihoods to calculate an overall score or if they prefer a vote based on cutoffs.

How many molecules and models can MolClass handle?

MolClass has no limit regarding the size of datasets and the number models it can process. The computational complexity is O(d*m) where d are datasets and m the models. Currently, the total storage space is limited to 1TB and the maximum storage per model is limited to 750MB. For performance reasons the maximum dataset submission size is currently limited to 20,000 molecules.

Can I upload all PubChem Bioassays into MolClass?

The majority of bio-assays in PubChem are small submission datasets from ChEMBL containing less than 10 molecules from small scale/targeted studies against specific targets. PubChem's Bioassay repository has about 200 datasets that we consider to be 'learnable' based on the size and hit and non-hit ratio. Since we submitted this publication, we have identified 30 new 'learnable' datasets in PubChem and we are currently adding those to MolClass. Furthermore, a large number of datasets in PubChem Bioassay are extremely imbalanced. For example, the data set with AID 488975 has 2634 hits discovered in a screen that contained a total of 303,873 molecules. Those data sets need to be subsampled using diversity measures to reduce the non-active class while conserving structural diversity. We will continually upload 'learnable' datasets from PubChem, Chembank and ChEMBL for target-specific models to improve MolClass over time. Further, we will add novel libraries and screening collections used in the scientific environment to cover chemical space that can be experimentally sampled.

How does the tool relate to other approaches like OpenTox or AMBIT services?

OpenTox  is a 'consortium to build an interoperable predictive toxicology framework which may be used as an enabling platform for the creation of predictive toxicology applications'. OpenTox aim is tailored towards the new European regulation on chemicals and their safe use, called REACH (Registration, Evaluation, Authorisation and restriction of Chemicals).  AMBIT is a REST service that provides access to models in the OpenTox framework which currently include SMARTCyp, CramerRules, Skin Irritation, Eye Irritation, Verhaar scheme (modified) for predicting toxicity mode of action, DSSTox Carcinogenic Potency,MLR model for caco2 and ECOSAR LC50 fish. MolClass is a tool that can contribute new models to the OpenTox framework through our REST service. Furthermore, MolClass can be useful for specific projects in any area of biomedical research, i.e. the tool is not just focused on human and environmental toxicity.

MolClass REST service

We implemented a REST for MolClass using OpenTox API 1.2 ontology. Here are few examples that return data in JSON format:


If you would like to link your compounds of interest to our html output you can use molecule_detail.php with the compound parameter:

MolClass Current Models And Datasets

Please follow the link to find a short batch description of our current model (learned data) and library (test data) selection.

CDK Small Molecule Descriptors

Please follow link to find descriptions for 152 chemical properties calculated in MolClass.

Molclass Database Structure

Database Structure of MolClass.

Current Implementation

MolClass latest version 1.1 uses CDK 1.4.5, Weka 3.7.5 and PHP 4.3

Known Limitations

Bugs Reported

Useful Online Resources

Personal tools