ACS Fall 2023 Symposium on Chemical Spaces

Symposium on Exploration of the Chemical Space

We extend a warm invitation to participants of the American Chemical Society Fall Meeting 2023 to join us for an engaging symposium dedicated to the exploration of chemical space. This symposium "Taking a Deep Dive into Chemical Space" will span over two days, Wednesday, August 16th, and Thursday, August 17th, featuring multiple sessions. Esteemed experts in the field will delve into the potential and methodologies employed in screening extensive compound collections, as well as share captivating case studies from the realm of drug discovery.

The symposium is organized by Dr. Christian Lemmen, CEO of BioSolveIT, and Dr. Paul Beroza, Distinguished Scientist at Genentech, and it will be hosted by the Division of Chemical Information (CINF).

The venue will be Nikko II, situated in the Hotel Nikko San Francisco. This event promises to deliver a comprehensive overview of the subject matter, offering captivating presentations and fruitful discussions facilitated by industry experts. Attendees can anticipate an intellectually stimulating environment where the latest advancements and discoveries in chemical spaces will be explored and shared.

"Taking a Deep Dive into Chemical Space" Symposium

Three exciting information-rich technologies are emerging at the interface of data and chemistry in drug discovery. First, synthesis-on-demand capabilities are uncovering huge portions of chemical space that can be computationally explored and readily realized. Second, technologies that reveal the secrets of protein structure – high-throughput crystallography, Cryo-EM, and computationally predictive methods like AlphaFold – provide a rich body of data that will likely bring structure-based drug design to a new level. Finally, whether you call it old-fashioned machine learning or fancy artificial intelligence, the ability of computational models to uncover patterns and meaning in data has never been stronger. At this symposium, we will hear from those at the forefront of these innovation themes and how they come together to enable a deep dive into chemical space.

Programme - Wednesday Morning August 16th

Welcome address by the organizers.
Profile-QSAR (pQSAR) is a massively-multitask, 2-level, stacked model. Every month, level-1 single-task random-forest regression (RFR) models are trained on 13,000 conventional Novartis pIC50 assays. Separate level-2 PLS regression models are then trained for each assay, now using the biological profile of predicted pIC50s from level-1 models as compound descriptors. The improvement is dramatic. 72% of the assays give successful pQSAR models vs. 8% of RFR single-assay models. The median correlation with experiment for pQSAR models is r2=54%, comparable to 4-concentration IC50 experiments. pQSAR has contributed to hundreds of projects over the last 15 years in many capacities: virtual screening, hit-list triaging, hit expansion, MoA prediction, off-target safety profiling, promiscuity prediction, polyphamacology, virtual counter-screens, scaffold hopping, etc. pQSAR has recently been extended to generative chemistry, and to multi-endpoint gene-expression and high-content assays.
pQSAR has shown accuracy comparable to or better than modern deep learning and matrix factorization methods However, each has unique practical advantages. This talk discusses several pQSAR advantages: that pQSAR is naturally adapted to federated models and to transfer learning; that it is embarrassingly parallel—running efficiently on standard clusters; that for single end-point pQSAR models, accuracy is assessed on the final models, leaving out only test-set data for the specific model currently being trained, rather than all test sets; and how the situation is more complex for multi-endpoint pQSAR models, especially those that combine multi- and single-endpoint models to expand the number of imputations vs. cold-start predictions.
The expansion of structural data and the emergence of dramatic potential in deep learning models creates an opportunity for transformative changes in structure-based research and drug discovery. We are building a comprehensive platform, termed GYST, to leverage computational tools in protein research. The pre-computed, rich data environment accelerates traditional investigations, increases opportunities for cross-fertilization and comparative analysis, while the amassed data also aids the development of machine learning models, whose output can be readily accessed by a broad audience of end users. We describe the core capabilities of the platform and several use cases that demonstrate the transformative potential of these computational capabilities thus applied at the proteome scale.
In the early stages of drug discovery programs, automated structure activity relationship (SAR) analysis traditionally relies on clustering, Quantitative-SAR models, or machine learning to identify new small molecule lead series from high-throughput screening assays. We propose a straightforward scaffold-based approach for identifying enriched chemical matter. Our technique picks chemotypes by breaking down hits into a network of scaffold and scaffold fragments, then uses rank-choice voting to select which scaffolds best represent their exemplars and confer enrichment. We will apply this method to publicly available high-throughput screening data to demonstrate the identification of enriched scaffolds present in confirmed protein target binders.
Coffee break
Synthetically accessible molecules are important in early drug discovery as a reliable source for getting new starting points and for performing structure-activity relationship (SAR) by catalog of the existing hits. Several examples of chemical spaces have been reported: Enamine REAL Space, WuXi Galaxi, and Otava CHEMriya. These chemical spaces differ from each other by the type of data utilized to create the space, chemical space coverage, diversity of chemistry, success rate, lead time, and eventually the pricing. An application of machine learning (ML) can help build and explore ultra-large chemical spaces more effectively.
Here we introduce Chemspace Freedom Space 3.0. Upon the creation of the Freedom Space, we pursued the novelty of compounds, and their accessibility and high deliverability. To achieve this, we performed ML-assisted classification of the reagents based on the reagent statistics provided by Enamine. The developed ML models were applied to over 200,000 in-stock reagents provided by different Chemspace suppliers to support the diversity of chemistry. We managed to achieve up to 83 percent recall of the “bad” reagents and up to 90 percent of “good” reagents after filtering. The ML-filtered reagents were combined utilizing eight common chemical transformations, acylation, reductive amination, and Suzuki coupling, - to name a few, and resulted in 5 billion molecules. The current version of the Freedom Space is 25 times larger compared with the previous version due to the application of ML. The success rate of 70 percent has been confirmed in a number of experiments.
The DEL-ML-CS workflow utilizes data from DNA-encoded libraries (DELs) screens for the generation of hits from small molecule chemical space (Enamine REAL, Chemspace Freedom) via the application of ML. We demonstrated the use of the workflow with the publicly available DOS-DEL-1 set (108,528 molecules) screened against Carbonic anhydrase 9 (CAIX). Disynthon aggregation strategy was developed and applied to reduce noise in data and normalize the counts. The regression prediction has been applied to rank the molecules from ChEMBL-32 (2,327,783 molecules). Our model selected 50% of the annotated CAIX actives in the top 0.56 percent of the ranked ChEMBL. We further applied the model to Chemspace Freedom and Enamine REAL datasets. The top 400 molecules were checked for novelty, synthesized, and tested against CAIX in a thermal shift assay (TSA) resulting in a 25 percent hit rate.
The transformer neural network architecture, first introduced in 2017 for machine translation, has gained widespread popularity across various scientific and technological fields because of its remarkable efficiency. A key feature of transformer-based artificial neural networks is the self-attention mechanism, which identifies and connects implicit relationships within input and output sequences.

Our studies discovered that transformer-based artificial neural networks excel in tasks such as converting chemical notations and optically recognizing molecular templates, specifically Markush structures. The talk will emphasize the applications of transformer-based architectures in chemistry, focusing on the aforementioned topics.

Furthermore, Large Language Models (LLMs), such as the GPT-series, LLaMA, and LaMDA, owe their exceptional general reasoning capabilities to transformer-based architectures. Despite their promise, the full potential of LLMs in chemistry remains largely untapped. We will discuss the capabilities of both vanilla and fine-tuned large language models in addressing typical cheminformatics challenges, including predicting organic compound properties and recognizing chemical entities.
The growing body of literature indicates that structural data in FBDD is still largely underutilized, and pure affinity metrics bias decision making in medicinal chemistry.

We have pioneered the “crystal structure first” approach and argue that the underutilization of structural data in FBDD is due to the unsystematic nature of the setup of soaking systems. Our unique target-to-hit approach leverages the SmartSoak® technology. SmartSoak® offers to structural biologists and medicinal chemists a systematic and efficient process for the setup of high-performance soaking systems.

In a case study, we use four protein kinase A (PKA) small-molecule fragment complexes as starting points for a template-based docking screen without prior knowledge of affinity3. Here, the Enamine's multibillion REAL Space was utilized. Out of the 106 chosen compounds, 93 molecules in total were successfully synthesized. At least forty compounds showed activity in validation assays, with the most active follow-up exhibiting an affinity increase of 13,500-fold. Six of the most promising binders quickly had their crystal structures determined, confirming the binding mode.

This innovative fragment-to-hit strategy achieved a 40% overall success rate in just 9 weeks. Since the early fragments would have been overlooked by the conventional industrial filters for fragment hit detection in a thermal shift assay, the results put into question the accepted fragment prescreening paradigm.
TBATo improve the usefulness of large chemical spaces we developed a high-quality 3D pharmacophore method based on Cresset’s molecular field technology. Together with the Company Cresset we established a sophisticated and reliable way to quickly search large combinatorial libraries with high fidelity.
The method uses a fragmentation and recombination approach to rapidly locate regions of the library space which are similar to the query molecule in terms of shape and electrostatics.
A Bayer proprietary space of about 200 Mio novel compounds can be searched in around 2 hours. The method has already been successfully adopted to Enamine’s Real Space with ~20 billion structure.
Lead finding and optimization will benefit greatly from BaySpace3D as new leads and ideas can be derived from Bayer’s novel and synthetically well described virtual 3D library.

Programme - Wednesday Afternoon August 16th

Welcoming address by the organizers.
Structure-based docking algorithms can sample and score binding poses in seconds, making it possible to evaluate large chemical libraries, and this approach is not restricted to compounds that are physically available. The size of libraries with commercially available compounds is growing rapidly and more than 30 billion make-on-demand molecules are currently available from chemical suppliers. These libraries provide opportunities to identify potential therapeutic agents that can readily be synthesized and tested for activity. Structure-based methods are currently restricted to one billion molecules, requiring enormous computational resources. As even approximate scoring functions can no longer process the full commercial compound databases, further development of effective strategies for traversing these enormous chemical spaces is required. We present our work using machine learning guided docking screens to narrow down these vast chemical libraries to small target-specific regions in chemical space. We retrospectively benchmarked our protocols on data from ultra-large docking screens of several hundred million molecules against eight different protein targets. We carried out blind predictions against two G protein-coupled receptors (GPCRs) involved in neurological disorders. Classifiers in combination with the conformal prediction framework were able to reduce a multi-billion-scale chemical library of make-on-demand compounds to a subset of candidates. Explicit docking of these molecules, followed by experimental evaluation led to novel GPCR modulators. Our results demonstrate that machine learning can facilitate chemical space exploration and the same approach can be applied to other drug targets.
The use of ultra-large chemical libraries and generative AI have opened vast avenues to drug discovery. We examine the success of such methods in retrospective studies to explore how they can accelerate drug discovery chemistry. In particular, we address whether our current access to chemical space can assess the druggability of specific targets and the diversity of drug scaffolds. We also present simple, practical measures to increase our access to chemical space in ways that benefit drug discovery.
The popularity and widespread use of artificial intelligence (AI)-based methods have increased dramatically over the last decade and AI application in the chemistry domain has seen an exponential increase in research publications since 2015. Some of the most promising areas are the prediction of the bioactivity of novel molecules, 3D protein structures from sequence data, and suggestions of synthetic routes to complex target molecules. We have leveraged the power of generative AI algorithms and ultra-fast virtual screening methods into AIDDISON™, our AI Software for novel drug design. AIDDISON™ incorporates SA-space™, a synthetically accessible chemical space of approximately 25 billion virtual compounds, to drive the best synthesizable molecular design. We will present examples of using this technology for scaffold hopping, new molecule design, and lead optimization in our ongoing R&D oncology projects.
In the last few years, early drug discovery efforts have been transformed by the rapidly growing availability of 3D structures, a better understanding of atomistic mechanisms of signaling, and the development of Giga-scale virtual libraries of drug-like compounds. This talk will describe two new synergistic computer-driven approaches to structure-based ligand discovery. The first one, V-SYNTHES, enables rapid identification of novel chemotypes for GPCR hits and leads in Giga-scale REadily AvaiLable (REAL) libraries. This iterative synthon-based virtual screening technology was recently validated by the screening of 11 Billion compounds in the prospective discovery of novel antagonists for Cannabinoid (CB) receptors and ROCK1 kinase. For CB receptors, chemical synthesis and experimental testing of compounds predicted by V-SYNTHES identified novel sub-micromolar antagonists, with the hit rate more than doubled compared to a standard virtual screening that required 100 times more computational resources. Optimization of the best lead series by a simple SAR-by-catalog screen in the same REAL Space identified CB2 sub-nanomolar antagonists with strong CB2/CB1 selectivity. The approach also shows promising results for other GPCRs and other classes of therapeutic targets in early discovery projects. The next generation, fully automated V-SYNTHES2.1 is being developed to include rapidly growing REAL Space (currently 173 Billion compounds), potentially expanding to Tera-Scale (1012-1015 compounds) screening in the next few years.
The second approach employs structure-guided bitopic derivatization of existing high-affinity ligand scaffolds to design new functional properties. A recent example involves the design of bitopic ligands targeting both orthosteric pockets and the allosteric sodium ion binding pocket in Class A GPCRs. Because this highly conserved site deep in the 7TM helical bundle is a key part of the Class A GPCR activation mechanism, the bitopic ligands acquire new functional properties. In the application to opioid receptors, we show that the extension of morphinan and fentanyl scaffolds to the sodium pocket can differentially modulate signaling towards specific functional pathways, involving distinct G-protein subtypes. Moreover, the functionally selective fentanyl-based design demonstrated effective analgesia without respiratory depression and other opioid side effects, while the approach is being tested in other class A GPCRs.
Coffee break.
The generative design of small molecules is a very exciting field that promises to make the lengthy and costly drug discovery process more efficient. Most deep-learning based generative methods, however, are either limited in their reward functions (e.g. 2D ligand-based QSAR models instead of 3D physics-based scoring functions) or in their ability to propose molecules that are chemically feasible. Here we introduce SAGE (Structurally Aware Generation), a method that uses genetic algorithms with 3D-physics based scoring to optimize molecules in a desired chemical space and with desirable drug-like properties. In one mode, SAGE performs the search in the > 20 billion compound Enamine REAL space. In a second, “close-in” mode, a desired starting core or scaffold is modified by “chemically reasonable” mutations. We show that SAGE is able to rapidly identify molecules in the Enamine REAL space similar to known inhibitors of the ROCK1 kinase, while also identifying novel compounds. We describe the application of SAGE’s close-in mode to an active portfolio project. SAGE is proving to be a valuable tool for both searching very large chemical spaces and for generating nearby analogs that are typically explored in lead optimization efforts.
Synthesized chemical space contains at least 450 000 unique ring systems and judicious selection for use in drug discovery to balance novelty and the probability of success, is a significant challenge. Current molecules in drugs and clinical trials only utilize 0.1% of this available pool of ring systems. Furthermore, 67% of small molecules in clinical trials comprise only ring systems found in marketed drugs, which mirrors previously published findings for newly approved drugs. We will show our triage mechanism and analysis of this large dataset of ring systems (including frequency, properties, and growth vectors) focussing on drugs and molecules in the clinic. Moreover, this analysis at UCB has delivered molecule starting points that have led to clinical candidate molecules for high value therapeutic PPI targets that were previously thought of as undruggable. We highlight simple systematic changes on existing drug and clinical trial ring systems to derive a small set of future clinical trial ring systems, which are predicted to cover approximately 50% of the new and novel ring systems entering clinical trials.
Machine-learning (ML) is an increasingly popular choice to identify bioactive, potent molecules in early-stage drug discovery. In structure-based approaches, ML can evaluate candidates by computationally generating and scoring co-complexes. Virtual, make-on-demand combinatorial chemical libraries provide a vast array of potential compounds for experimental validation; however, the sheer size of these libraries now exceeds trillions of molecules, making exhaustive searches impractical. To overcome this limitation, we present and compare two methods for traversal of virtual chemical libraries: Analog-Based Explorer (ABE) and Synthon-based Discovery (SbD).

ABE navigates the library by iteratively finding, scoring and selecting analogs. AtomNet®, a structure-based graph neural network trained on bioactivity measurements, is used to score based on predicted affinity for the target of interest. We incorporate fragment-based similarity search methods, including SpaceMACS, SpaceLight, and CSLVAE, to enable scaling to ultra-large libraries.

SbD is an alternative library exploration approach inspired by V-SYNTHES that uses AtomNet® to evaluate synthons individually, then combines favorable synthons together into products. We compared the performance of ABE and SbD and demonstrate how they complement each other in different contexts with predictions from different areas of chemical space.

To further optimize our search, we combined ABE and SbD in a serial manner, with the outputs of SbD fed into ABE for virtual hit expansion. Our results demonstrate the effectiveness of these methods in navigating virtual chemical libraries and identifying potential bioactive molecules, providing a valuable tool for early-stage drug discovery.

Programme - Thursday Morning August 17th

We will show Chemical Space Docking, a novel, synthon-based, virtual screening method, which efficiently leverages ultra-large databases. The approach combines two distinct advances: it avoids full library enumeration, hence making bigger chemical spaces accessible, and secondly, products are evaluated by molecular docking, which utilises protein structural information. We applied Chemical Space Docking to identify inhibitors of ROCK1 kinase from almost one billion commercially available synthesis-on-demand compounds of the Enamine REAL space. From 69 synthesized molecules, 39% had Ki values below 10 µM. Two leads were crystallized with the ROCK1 protein, and the structures showed excellent agreement with the docking poses. We will also discuss the efficiency of that method, which only scales with the number of synthons (building blocks), and is therefore way more energy efficient compared to the computational power needed for conventional docking of fully enumerated libraries; hence, it is giving access to much bigger libraries and is also magnitudes faster.
The relatively recent emergence of deep learning-based AI has opened the door to generation of fit-for-purpose molecules for drug discovery and other applications. The advantage of generation over prediction lies in the vast chemical space that can be considered, going beyond molecules known to exist or which have been explicitly imagined or enumerated. Here we determine the size, global diversity, local diversity, and fit for purpose of molecules generated for a common target using a variety of generative approaches which vary with respect to number of targets included in the training and use or absence of 3D protein structural information. The results provide guidance on the use of specific approaches for lead finding or lead optimization.
In recent years, the sizes of chemical spaces such as make-on-demand catalogs have grown rapidly. To handle these ever-increasing amounts of data, chemical fragment spaces have been introduced. Instead of the entire list of enumerated products, fragments are stored along with connection rules. Chemical fragment spaces are not only space-efficient by themselves, but many cheminformatics tasks can be implemented such that they scale with the number of fragments rather than with the number of enumerated molecules. Using fragment spaces, similarity and substructure searching is possible in seconds to minutes on regular desktop machines. To achieve synthetic accessibility, fragment spaces are usually built on expert knowledge about reliable reactions and their reactants. The space of SAVI is such a chemical space based on LHASA reaction rules, which have been developed and refined by organic chemists since the early 1980s. So far, this chemical space has only been available as an enumerated version, which makes chemical searching resource intensive. Here, we present the translation of SAVI into a chemical fragment space focusing on the semi-automated process, which we accordingly named "SAVI Space”. This version of SAVI can now be efficiently searched, analyzed, and compared to other spaces such as Enamine REAL Space. Moreover, we can incorporate new or adapted reaction rules easily, and/or apply the reaction rules to new building blocks to enlarge the existing, or create entirely new, chemical spaces.
Recently, 'tangible' virtual libraries have made billions of molecules readily available. Prioritizing these molecules for synthesis and testing demands computational approaches, such as docking. Here we explore three ideas. 1) For many binding sites, there are many different ligands that can fit. How many molecules? How many chemotypes? 2) As the libraries grow, we find new molecules that we do not find in smaller databases. How long might this trend continue? 3) The molecules found on a deep dive often have interesting and unexpected properties. What are some of these? Examples will be presented to exemplify each of these ideas.
Rational drug discovery is seeing a new wave of deep learning models with increasing accuracy, while increasingly demanding more explainability out of such models. In this talk I will introduce a series of our works on explainable prediction of compound-protein interactions where intermolecular contact prediction underlies simultaneous affinity prediction.

First, to address the challenges to conventional structure-based docking approaches, such as the limited structure data and the expensive docking processes, we have developed DeepAffinity that integrates knowledge- and learning-based approaches using chemical identities and protein sequences. We propose a semi-supervised deep learning model that unifies recurrent and convolutional neural networks to exploit both unlabeled and labeled data, for jointly encoding molecular representations and predicting binding affinities. Furthermore, attention mechanisms are introduced for the interpretability.

Second, we have performed large-scale assessment of sequence-based affinity predictors’ interpretability and found that commonly used attention mechanisms alone are inadequate. We thus regularize attentions with predicted 3D structural contexts and supervise them with non-bonded atomic contacts, which leads to DeepAffinity+. We further design physics-inspired DeepRelations with an intrinsically explainable architecture where various atomic-level contacts lead to molecular-level affinity prediction. DeepAffinity+ and DeepRelations significantly boosts model interpretability without compromising its accuracy in affinity prediction. We further demonstrate theses models’ utilities in contact-assisted docking, structure-free binding site prediction, and structure–activity relationship studies.

Last, thanks to the recent breakthroughs in protein structure prediction, we consider protein data as available in both modalities of 1D amino-acid sequences and predicted 2D contact maps and we introduce cross-modality protein embedding schemes. Moreover, we upgrade our previously used un-supervised learning to self-supervised learning, without the need of experimental affinity labels, to pre-train the embeddings. Our results indicate that our cross-modal and self-supervised framework could further improve the accuracy, the explanation, and the generalizability of affinity prediction especially for unseen proteins.
Ultra-large chemical spaces describing several billions of compounds are revolutionizing hit identification in early drug discovery. Because of their size, such chemical spaces cannot be fully enumerated and requires ad-hoc computational tools for their exploitation. We here propose a structure-based approach (SpaceDock) to ultra-large chemical space navigation, in which commercial chemical building blocks are first docked to the target of interest. Applying simple organic chemistry-driven rules and topological constraints, suitable building blocks are then directly connected to enumerate full drug-like compounds under three-dimensional constraints of the target. The accuracy of the docking step was first verified on an in-house set of building blocks derived from the X-ray structure of protein-bound fragmented ligands, and appeared to be excellent and docking-tool independent. When applied to bespoke chemical spaces of various sizes targeting different proteins, SpaceDock was able to quickly generate virtual hits that are either close to known ligands of the investigated targets, or chemically novel.The approach is generic, can be applied to any docking algorithm and requires few computational resources to pick easily synthesizeable compounds from multibillion chemical spaces.
The full value of an organization’s data is realized through the insights and decisions it informs. Drug discovery data is typically sparse (often >95% unmeasured) and uncertain, making it challenging to make data-based decisions with confidence. We miss opportunities because there are insufficient data on which to base decisions, or we are led astray by artifacts and anomalies.

Deep learning imputation fills in the gaps in a discovery organization’s database with high-quality predictions, providing a rich matrix of data to guide projects’ progression. Illustrated by three collaborations from biotech, fragrances, and agrochemicals, we will highlight unexpected benefits and research advantages. We will discuss the value this brings to support decision-making, ultimately reducing the time and cost of discovery cycles. These values include:
- Accurate prediction of complex endpoints, even where traditional machine-learning approaches fail
- Highlighting interesting, high-value results, such as activity cliffs, for further exploration
- Planning experiments to increase return on investments in expensive downstream assays
The emergence of massive on-demand chemical collections (ODCCs) is potentially transformative for drug discovery. Current ODCCs contain tens of billions of molecules (1010) and are expected to reach the trillion scale (1012) in the coming years. If used properly, virtual screening (VS) will not only be faster and cheaper that high-throughput screening (HTS), but also better, delivering better and more diverse hit molecules. But, first, it will be necessary to develop new tools capable of navigating such massive collections. To address this challenge, we have conceived a novel virtual screening strategy that explores the chemical universe from the bottom up. First, we perform a systematic search of low molecular weight compounds (up to 12-14 heavy atoms), identifying multiple diverse scaffolds of interest. Then, we perform substructural searches on ODCCs, extracting millions of molecules that contain those scaffolds that are subjected to further VS. Using a hierarchy of increasingly sophisticated computational methods, we have devised a protocol that is computationally efficient and also maximises diversity and success rate. An implementation of the concept, together with some prospective results will be presented.