logo

AI-assisted Retrosynthetic Analysis

Synthetic organic chemistry involves in drug discovery and chemical biology. Retrosynthesis is one of the most complex issues in the field of organic chemistry, which is the design of efficient synthetic routes for a given target. It is an efficient and environmental-friendly synthesis of valuable molecules with well-designed and feasible routes. Evidence show that similar products tend to be produced by similar reactions (Reaxys or SciFinder). For general retrosynthesis planning, a proposed step can come from any reaction class. Key considerations include the need to identify a cascade of disconnections schemes, suitable building blocks and functional group protection strategies. Many additional considerations in synthetic route planning are not limited to cost, process complexity, reaction yield, workup difficulty, safety, and toxicity of intermediates.

Artificial intelligence (AI), driven by improved computing power, data availability and algorithms, underpins chemical drug development. AI-assisted retrosynthetic analysis is starting from the target compound and working backward, which has been well-reviewed over the years. AI approaches have also been reported for prediction of reaction outcomes and optimization of reaction conditions. Potential application of retrosynthetic program may play an important role in de novo molecular design and automated synthesis of molecules.

protheragen CD ComputaBioComputer-aided Retrosynthetic Route Planning

The earlier retrosynthesis programs are mainly computer-aided retrosynthetic analysis tools. Template-based methods for retrosynthetic analysis rely on human knowledge of organic synthesis, as well as the encoding of organic and mechanistic rules.

Template-based, rule-based or similarity-based methods

  • protheragen CD ComputaBio Confirm the extent of generalization and abstraction.
  • protheragen CD ComputaBio Choose reaction databases.
  • protheragen CD ComputaBio Encode reaction templates or synthon generation rules.
  • protheragen CD ComputaBio Generalize subgraph matching rules (subgraph isomorphism problem).
  • protheragen CD ComputaBio Extract the meaningful context around the reaction center.
  • protheragen CD ComputaBio Baseline model (Liu et al).
  • protheragen CD ComputaBio Synthia (formerly Chematica) developed by Grzybowski and coworkers.
  • protheragen CD ComputaBio ReactionPredictor from Baldi's group based on mechanistic views.

Formalized Analysis Workflow Based on Molecular Similarity

  • protheragen CD ComputaBio Retrieve reaction precedents from the knowledge base.
  • protheragen CD ComputaBio Template extraction approaches. Only the atoms which are immediately involved in the reaction are contained in these templates. It is specified by atomic identity, aromaticity, number of hydrogen atoms, and chirality if applicable.
  • protheragen CD ComputaBio Score and rank candidate precursors relying on quantitative similarity scores (Morgan2noFeat / Morgan3noFeat / Morgan2Feat /Morgan3Feat fingerprints and the Tanimoto metric, graph neural networks).

Data Set Descriptions of Reaction Classes

  • heteroatom alkylation and arylation
  • C–C bond formation
  • heterocycle formation
  • protections
  • deprotections
  • reductions
  • oxidations
  • functional group interconversion (FGI)
  • functional group addition (FGA)

Limitations of Template-based Methods

  • protheragen CD ComputaBio Trade-off between generalization and specificity.
  • protheragen CD ComputaBio Template extraction algorithms only consider reaction centers and their neighboring atoms rather than take chemical environment of molecules into considerations.
  • protheragen CD ComputaBio Unsettled problems of mapping the atoms between products and reactants, and the abundance of distinct leaving groups for equivalent reaction sites.
  • protheragen CD ComputaBio Mechanisms outside the knowledge database cannot be predicted.

protheragen CD ComputaBioAI-assisted Retrosynthesis Planning

Encode chemical reactions as sentences using reaction SMILES in the natural language (NL) framework. Treat forward- or retro- reaction prediction as a translation problem, using different types of neural machine translation architectures.

Transformer-based retrosynthesis (template-free model)

We present a template-free approach that is independent of reaction templates, rules, or atom mapping, to implement automatic retrosynthetic route planning.

Retrosynthetic Route Planning Strategies

  • protheragen CD ComputaBio Forward reaction prediction (Molecular Transformer).
  • protheragen CD ComputaBio Neural model based on the sequence-to-sequence (seq2seq) architecture (Liu et al.).
  • protheragen CD ComputaBio Template-free self-corrected retrosynthesis predictor (Zheng et al.).
  • protheragen CD ComputaBio Monte-Carlo tree search algorithms (Segler et al.)
  • protheragen CD ComputaBio One-step retrosynthetic model.
  • protheragen CD ComputaBio Multistep synthesis plans.
  • protheragen CD ComputaBio Hyper-graph exploration strategy for automatic retrosynthesis route planning.
protheragen CD ComputaBio

Figure 1 Schematic of the Multi-step retrosynthetic workflow. (Philippe Schwaller, et al. 2019)

Evaluation metrics for scalable single-step retrosynthetic models

  • Round-trip accuracy
  • Coverage
  • Class Diversity
  • Jensen-Shannon Divergence (similarity of class probability distributions)

Workflow

protheragen CD ComputaBio

Figure 2 Workflow of Transformer-based Retrosynthesis

In CD ComputaBio, the developed approach mimics the retrosynthetic strategy, which is defined implicitly by a corpus of known reactions without the need to encode any chemical knowledge. We have also tried token- and character-based methods to tokenize the SMILES strings as model input, and used the open-source chemoinformatics software RDKit to validate the model. CD ComputaBio has assessed the entire framework by reviewing several retrosynthetic problems to highlight strengths and weaknesses.

Online Inquiry