This article has been written by Nicolas Drizard from Iktos, the work presented is a team effort by Nawfal Tachfine, Maoussi Lhuillier-Akakpo, Brice Hoffmann, Nicolas Drizard and Nicolas Do Huu.

Data Center – Photo by Tanner Boriack on Unsplash

The MELLODDY project provides a unique platform for federated learning in drug discovery. Several pharma companies are contributing to its development, both to provide training data for the global model and to evaluate if the global model performs better than the one built solely on their data. As is common with machine learning models, the more data is fed into the platform, the better the global models tend to be. As a result, we decided to include an extra data source from the public domain which would augment the entire chemical space seen by the platform. This public dataset would also be relevant for development and testing purposes since it can be easily shared among the partners as opposed to proprietary chemical datasets.

Extraction And Processing

The public dataset was extracted from ChEMBL (1) which provides a curated database of drug-like molecules measured on various assays. We used version 25 of the database, which contains a total of 1.8M compounds and 1.1M assays.

We extracted three different assay categories: ADME and toxicity assays, physico-chemical assays, and binding and functional assays. Given that the ChEMBL dataset is very heterogeneous, we had to apply some filters to restrict our extraction to relevant data, e.g. unit filtering and conversion, non-numeric value removal, etc. Then to reduce the number of prediction tasks, we removed measures with less than 50 values and when possible, applied an aggregation step.

We manually merged the physico-chemical assays when experimental protocols (pH, solvent, etc.) were similar to each other. ChEMBL provides a confidence score for the relationship between the target and the assay for the binding and functional assays, and we retained only those with a reasonable score. Compared to proprietary pharma data, public data are a collection of assays measured on a relatively small number of compounds – often from the same chemical series. These data are usually generated and shared by different contributors, and therefore the same assay can be measured in different ways. To maintain a reasonable number of assays (given the number of compounds) and to keep them “predictable” i.e. with a sufficient number of measured compounds, we decided to merge functional assays similar to each other if they had the same units of measurement and comparable value distributions. This was achieved by the DBScan clustering algorithm using the Kolmogorov-Smirnov statistic as distance. This method led to a diminution of approximately 20% of the total number of assays in the output public dataset.

Usage

The motivation for preparing this dataset was already strong, but once available it began to be used more generally and in more use cases. A test environment using it has been developed to replicate a global run from a single pharma partner. Also, a multi-partner setting has been replicated using a scaffold split to dispatch the structure into different “virtual” pharma partners, simulating their occupancy in the chemical space and thus a federated run scenario for testing. This dataset is more practical and useful than one with randomly generated values since it accounts for most of the data patterns observed in real-life data (activity cliffs, noise, chemical space sparsity, etc.). Furthermore, the single partner studies – where one feature is developed and evaluated before integration in the platform – also benefit from the dataset. Finally, privacy studies have also been conducted on this dataset in light of the yearly federated run set-up; for instance, the platform has been assessed against different scenarios of a reconstruction attack where the attacker aims to find chemical structures seen during training while having only access to the model. After these studies, it has been decided to create a public partner using a selected subset of the public dataset to limit the risk of any reconstruction attack. This partner is used as any pharma partner to train the multi-partner model.

Conclusion

The extraction and analysis of this public dataset revealed how important the data are in a machine learning pipeline developed for molecular data, and the rigorousness needed when dealing with heterogeneous data sources. The merging procedure we implemented, still has some drawbacks since certain assumptions had to be made to decide when to merge assays as the experimental conditions are not always well defined. Nonetheless, this processing workflow provides a tool to automatically extract from the growing ChEMBL database, a curated multi-task dataset for any machine learning benchmarks.

Give it a try ! (2)

(1) https://www.ebi.ac.uk/chembl/

(2) https://github.com/melloddy/public_data_extraction

Find full content here: https://www.melloddy.eu/blog/preparing-public-dataset