IBM RXN for Chemistry/IBM Research Europe und Science of Synthesis and Synfacts/Thieme Chemistry
Cooperation brings together machine learning and expert human-curated data with unprecedented results in synthesis reaction planning.
Stuttgart, November 2021 – The cooperation between IBM Research Europe and Thieme Chemistry, which was announced this summer, builds on the synergies between high-quality data (Science of Synthesis and Synfacts by Thieme) and state-of-the-art machine learning models for organic chemistry synthesis predictions (RXN for Chemistry by IBM) to create an unprecedented user experience. RXN For Chemistry, the cloud platform using artificial intelligence (AI) has recently been trained with highest-quality, human-curated datasets from Thieme’s Science of Synthesis and Synfacts. IBM Research Europe and Thieme Chemistry now announce the first results of their cooperation, which were evaluated by seven eminent synthetic chemistry experts and their research groups from China, Germany, Switzerland, New Zealand, and the USA.
Organic compounds can react with each other in hundreds of thousands different ways. Experiential knowledge is key for organic chemists to avoid spending hours and hours in the laboratory with countless trials and errors. To improve synthesis planning, IBM Research and Thieme Chemistry have combined the expert human-curated datasets from Thieme’s full-text resource for methods in synthetic organic chemistry, Science of Synthesis, and the reviewed content from the journal Synfacts with the artificial intelligence model called Molecular Transformer in RXN for Chemistry by IBM.
The Molecular Transformer, a neural machine translation model, was created to reliably predict the outcome of chemical reactions and was later enhanced to include retrosynthetic analysis. The model has proven to be very successful at learning the information of chemical reactivity present in datasets of chemical reactions. It is, however, limited to the content and correctness of these datasets.
Increased prediction accuracy
Science of Synthesis and Synfacts cover a wide area of reaction space. Typically, models trained on commercially available patent datasets perform poorly on many such reactions. Results show that Thieme-trained models on the RXN for Chemistry platform increase prediction accuracy by a factor of three for forward predictions, and a factor of nine for retrosynthesis.
An analysis of the datasets relying on reaction fingerprints and clustering algorithms revealed that the existing patent data and Thieme data are complementary to each other in terms of reaction coverage, with some chemistry covered exclusively in Science of Synthesis and Synfacts. Combining both during training enables us to maximize the knowledge learnt by the models.
As is shown in the table below, Science of Synthesis and Synfacts have a higher quality of chemical records, reflected by a larger percentage of usable records. This consistency in Thieme’s dataset facilitates the learning process of the AI models, resulting in more consistent predictions.
|Collection||Usable for AI|
|Commercial Patent Dataset||~35%|
The RXN model retrained with Science of Synthesis and Synfacts data achieves a chemical accuracy of ~70% on the prediction of complex chemical records, and provides diverse retrosynthetic recommendations, with suggested reactions closely related to the ones present in Thieme's data.
The collaborative work between Thieme and IBM Research Europe shows the impact high-quality chemical reaction data can have on future AI chemical synthesis tools. Integrating high-quality, curated data from Science of Synthesis and Synfacts provides a unique opportunity to boost the performance of RXN for chemistry to unprecedented levels as it unleashes the entire knowledge contained in hundreds of thousands of chemical reaction records.
Insightful feedback from synthetic chemistry experts
Seven highly-renowned organic synthesis experts and their groups from around the world agreed to evaluate the retrained models. The experts will continue to provide insightful feedback to IBM Research Europe and Thieme during this collaboration, enabling improvements to the models and their usage, as well as creating a unique forum for exchange between machine learning experts and the synthetic organic chemistry community:
“This innovative IBM/Thieme Chemistry platform provides an efficient tool for synthetic chemistry researchers to provide validation for their own retrosynthetic plans whilst also being presented with alternative solutions. It enables a rigorous assessment for the retrosynthetic design phase of a given synthesis which no doubt will pay dividends when the selected synthetic plan is implemented.”
Prof. Dame Margaret Brimble (University of Auckland, New Zealand)
“A sustainable future for synthesis will include minimizing the number of unproductive strategies that are pursued by running only those reactions that lead to a productive end. This is only possible through the marrying of computer designed and human designed efforts, which makes this collaboration with IBM and Thieme Chemistry exciting."
Prof. Richmond Sarpong (University of California, Berkeley, USA)
Also involved in testing the retrained models were Prof. Alois Fürstner (MPI Mülheim, Germany), Prof. Karl Gademann and Prof. Cristina Nevado (University of Zurich, Switzerland), Prof. Ang Li (Shanghai Institute of Organic Chemistry, China), Prof. Dirk Trauner (New York University, USA) and their research groups.
On December 1st, 2021, IBM Research Europe and Thieme Chemistry will be holding a free Web seminar, where the outcome of their collaborations will be outlined. The teams will compare the performance of language models trained on the highest-quality commercially available datasets (Science of Synthesis and Synfacts) to that of publicly available patent reaction records, with a specific focus on retrosynthetic and chemical prediction tasks.
If you are interested to participate, please register here: Web seminar “Powering Molecular Transformers with High Quality Data”
Would you be interested in using IBM RXN for Chemistry, trained on Science of Synthesis/Synfacts data, as a cloud service if it should become available later? Please contact firstname.lastname@example.org.
IBM RXN for Chemistry is available for free at: https://rxn.res.ibm.com
About IBM Research
For more than seven decades, IBM Research has defined the future of information technology with more than 3,000 researchers in 16 locations across five continents. Scientists from IBM Research have produced six Nobel Laureates, 10 U.S. National Medals of Technology, five U.S. National Medals of Science, six Turing Awards, 19 inductees in the National Academy of Sciences and 20 inductees into the U.S. National Inventors Hall of Fame.
IBM Research has been developing data-driven chemistry solutions based on language models for over four years. In 2018, IBM launched RXN for Chemistry: The cloud platform uses an artificial intelligence model called Molecular Transformer which applies neural machine translation models to predict the outcome of a chemical reaction and thus, improve synthesis planning in organic chemistry.
For more information, please visit www.research.ibm.com