Disrupting pharmaceuticals with Machine Learning, supercomputers, and Big Data
Artificial Intelligence for Increased Drug Discovery Efficiency – Machine Learning, Supercomputers, and Big Data
The increasing cost of the drug development process and the declining returns on investment for new drugs pose significant problems for the pharmaceutical industry. Emerging technologies have the potential to address these problems by dramatically improving the efficiency of both the research and the manufacturing process. In particular, artificial intelligence (AI) holds great promise as a tool for enhancing drug discovery research, and pharmaceutical companies are already making investments into AI-based applications in this area. The market for healthcare-related AI applications is projected to grow to $8 billion by 2020, driven primarily by investments in AI technology that can improve the drug discovery process.
The quality of an artificial intelligence solution depends on three factors: the machine learning algorithms used, the computing power leveraged to run the algorithms, and the large amounts of pre-clinical and clinical data processed by the algorithms. An increasing number of companies provide machine learning tools that are specifically designed to work on pharmaceutical applications, accelerating the process of disease target identification, compound screening, de novo drug design, and improving predictions of the clinical efficacy and toxicity of new drugs, as well as other drug characteristics, such as absorption, distribution, metabolism, and excretion (ADME). These tools have become more powerful due to recent advances in AI algorithms and expanding access to powerful computing resources, such as supercomputers and novel GPU-based AI accelerators. Though rare, even quantum computers are being used for AI-driven drug development. Access to good data remains a significant limitation on the progress of these AI systems, however, as their ability to generate insights depends greatly on the quality and amount of data that can be supplied to them. There is an increasing effort among public and private companies to aggregate and standardize the data sources commonly used for drug development. With the advent of advanced research equipment (e.g., next-generation sequencers), the ongoing digitisation of healthcare organizations, and the emerging Internet-of-Things infrastructure, an increasingly wide variety of high-quality medical data will likely play an important role in improving AI-based approaches for drug discovery.
Many pharmaceutical companies are partnering with machine learning, computing, and big data companies to investigate AI-based research techniques. For example, Johnson & Johnson are moving into Phase IIb clinical trials with a drug that was identified by an AI-based analysis as potentially effective in treating other medical conditions. It is important, however, to understand the advantages and disadvantages of different AI systems, as they are often optimized for specific applications. By investing in a suite of different systems, pharmaceutical companies will be able to use AI-accecerated solutions in most stages of the drug development process—from discovery to clinical trials—and may also be able to identify disruptive new treatments for complex diseases.
The drug development process is becoming less efficient. Factors such as the high research and development costs for new drugs (around $2 billion for each approved treatment), low success rates for clinical trials involving new products (less than 12%), and smaller returns on investment—due to reductions in overall healthcare expenditures and the increasing focus on rare diseases—have all impacted the development process for new pharmaceutical products. About 15-20% of the costs incurred during the development of new drugs are spent during the discovery phase, often amounting to hundreds of millions of dollars. Reducing the time and the cost of the drug discovery process and increasing the success rate of clinical trials for new drugs are imperative goals for the pharmaceutical industry.
Using computer simulations for drug development—also known as in silico screening, design, and testing—has great potential to reduce development costs and increase the success rate of drug development pipelines. This idea is not new, as computer-based analysis methods such as homology modelling, molecular docking, quantitative structure-activity relationship modelling, and molecular dynamics simulations have been in use since the 1990s. The advent of modern predictive analytic tools, and particularly those related to Artificial Intelligence, however, have led to an exponential increase in the power of in silico. The term Artificial Intelligence (AI) is used here to refer to a variety of predictive analytic methods, including predictive modelling, machine learning, neural networks, deep learning, and data mining.
Pharmaceutical companies are investing in artificial intelligence solutions to enhance disease target identification, compound screening, de novo drug design, and to develop potency/toxicity predictions. The healthcare AI market is currently valued at around $700 million, and is expected to grow at a compound annual growth rate (CAGR) of 53%, reaching a market value of $8 billion by 2022.
Drug discovery applications make up the largest chunk of the healthcare AI market (over 35%), while other innovative applications for AI are being developed in fields such as medical imaging, diagnostics, therapy planning, and hospital workflow management.
The performance of these predictive tools relies on three key components: the algorithm (i.e., the core infrastructure), the computing power (i.e., the engine), and the data (i.e., the fuel). Rapid developments have recently occurred on all three of these fronts, creating powerful tools that pharmaceutical companies can use to gain a better understanding of complex diseases and discover advanced treatments.
Machine learning - The core infrastructure
Vast computing power and large amounts of data are not enough to perform predictive modelling. To effectively process “big data” and identify new insights, efficient algorithms are also required. Algorithm development is evolving at a rapid pace due to the revolution in artificial intelligence. At the heart of this revolution is machine learning—a powerful set of methods for discovering patterns in data sets. One of the most advanced areas of machine learning methods is “deep learning”, which relies on hierarchical artificial neural networks to discover subtle and complex relationships in data. Deep learning lends itself exceptionally well to drug discovery because of its ability to identify complex relationships from large or small datasets comprised of raw, unprocessed data. This approach is advantageous for identifying new disease targets, generating novel leads, and predicting drug outcomes. There are different ways that machine learning algorithms can “learn” from data. Unsupervised machine learning techniques can help to identify hidden patterns in medical and biological research data without requiring the researcher to specify a particular target in advance. The patterns identified by this approach can then be used in a variety of medical applications, such as pursuing new diseases targets. Virtual screening and de novo drug design can be achieved with reinforcement machine learning, using methods like modelling and quantum chemistry. Supervised machine learning can be used with drug and clinical trial data to make predictions of a product’s efficiency and toxicity, as well as predictions of a finished product’s key characteristics, such as its absorption, distribution, metabolism, and excretion (ADME). By leveraging the right AI algorithms, much of the drug development process can be done in silico, leading to cost savings and lower risks for companies pursuing new research.
Machine Learning (ML): This is a subset of artificial intelligence that focuses on computer programs that can adapt or “learn” when exposed to new data. This “learning” or “progressive improvement in performance” can be achieved with task training (supervised learning), with no feedback (unsupervised learning), or with performance feedback (reinforcement learning). These methods lead to the creation of accurate and precise predictive algorithms that would be too complex for humans to develop (Figure 1).
Artificial Neural Networks (ANN): This is an information processing algorithm that greatly improves machine learning performance. It is inspired by how biological nervous systems, such as our brains, process information. This approach typically uses of a network of nodes (or artificial neurons) that are stacked in different layers and are connected together to process input, modulate each other, and generate output. The modulation happens by the algorithm working to find the optimal value for each node to generate the best possible output for a set of input values. Although these algorithms can now be run on desktop computers, supercomputers and AI accelerators significantly increase their potential to process large amounts of data and utilize more complex network designs (described in “Supercomputers – The Engine”, in this report).
Deep Learning (DL): This subset of ANN is a more recent development in artificial intelligence, and is characterised by the use of multiple “hidden layers” of nodes. The hierarchy of layers enables the algorithm to create more complicated patterns of connections in higher layers based on simpler lower layers. Due to its capability to model high-level abstractions in data through multiple non-linear transformations, it has exponentially accelerated machine learning performance (Figure 2).
Figure 1. Types of machine learning: supervised, unsupervised, and reinforcement learning.
Figure 2. Neural network vs deep learning neural network.
One prominent example of an advanced machine learning system is Google’s DeepMind. This system uses a convolutional neural network with a form of model-free reinforcement learning, which means that no predefined model of the environment/data is provided. Instead, the neural network algorithm teaches itself from the data it receives, and learns how to use it to achieve the best outcomes. One of Google’s recent applications of this system, AlphaGo Zero, taught itself how to beat a human master in the complex board game Go. AlphaGo Zero is now being trained by Google to predict protein folding.
An increasing number of companies are creating AI-based solutions that can aid in the drug development process. This list highlights several notable companies, and describes the services they provide:
What: predicts the bioactivity of small molecules
How: uses deep learning on convolutional neural networks (AtomNetTM) for molecular modelling
Partners: AbbVie (confidential application), Merck (confidential application)
What: produces better target selection, designs new molecules, and optimises compounds
How: deep learning (Judgment Augmented Cognition SystemTM) to mine and analyse biomedical information from clinical trial data and academic papers
Partners: Johnson & Johnson (drug repurposing; currently in Phase IIb trials for a drug that may treat symptoms of drowsiness related to Parkinson’s disease)
What: developing precision medicine solutions to provide patient-specific predictions of drug efficacy and toxicity
How: deep learning (Interrogative BiologyTM) to assess patients’ adaptive-omic biological data
Partners: AstraZeneca (targets and leads for neurodegenerative diseases), Sanofi (biomarkers for flu vaccine performance)
What: assists in small molecule drug design and pre-assessment for potency, selectivity, and ADME
How: uses machine learning (trade secret) with data from various experimental, structural, and clinical databases
Partners: GlaxoSmithKline (small molecules for 10 disease-related targets), Sanofi (small molecules for metabolic diseases), Sumitomo Dainippon Pharma (small molecule against two GPCR receptors), Evotec partnership which includes Bayer, Sanofi, Roche/Genentech, Johnson & Johnson, and UCB (small molecules for immuno-oncology)
What: works on drug discovery and repurposing, biomarker identification, and clinical trial design
How: uses deep learning on generative adversarial network (DeepPharmaTM) to assess massive multi-omics data
Partners: GlaxoSmithKline (biological targets and pathways)
What: works on small molecule drug discovery and optimisation, including activity and toxicity prediction
How: machine learning (trade secret) that can use both small and large databases
Partners: Boehringer Ingelheim (leads for infectious diseases), Merck (leads for cardiovascular diseases), Servier (small molecule modulator design for cardiovascular disease target), Takeda (oncology, gastric, and central nervous system disorders)
What: cellular disease models for target discovery and activity/toxicity predictions
How: deep learning (trade secret) to analyse in-house experimental biology data
Partners: Takeda (leads for rare diseases), Sanofi (drug repurposing for genetic diseases)
What: discovering, screening, and prioritizing drug candidates
How: machine learning (DUMATM) with gene expression measurements, protein interaction networks, and clinical records
Partners: Asian Liver Center at Stanford (leads for hepatocellular carcinoma), Santen (leads for glaucoma)
Other notable companies include (1) Roche/Genentech and GNS Healthcare (cancer drug targets), (2) Accelerating Therapeutics for Opportunities in Medicine (ATOM) consortium with GlaxoSmithKline (moving from drug target to patient-ready therapy in less than a year), (3) Deep Genomics, start-up at Johnson & Johnson Innovation (using antisense oligonucleotides to manipulate cell biology and treat diseases), and (4) Turbine, start-up formed at Bayer Open Innovation (developing molecular models for cancer biology for better biomarkers).
Supercomputers - The engine
Machine learning algorithms must be run on a computing platform. Although simple machine learning algorithms can be run on desktop computers, supercomputers can significantly increase the power of machine learning methods, due to their capabilities to execute more complex AI algorithms and to work with larger datasets.
Increases in computing power have driven improvements in the performance of predictive modelling and artificial intelligence applications. It is projected that a supercomputer will reach the 1 exaFLOPS—or be able to perform one billion billion floating point calculations per second—by 2021. This level of processing power is believed to be comparable to the processing power of the human brain, and would allow for very powerful data analytics and predictive modelling.
Currently, China’s “Sunway TaihuLight” is the world’s fastest supercomputer, with an estimated performance of 93 petaFLOPS. It has been used for a number of commercial applications, including oil prospecting, weather forecasting, industrial design, and pharmaceutical research. However, it consumes a massive 15.4 MW of power in operation. IBM’s better-known “Watson” is a cluster of 90 supercomputers, and has an estimated 80 teraFLOPS of total processing power. IBM has allowed Watson to be used for a variety of commercial applications, including drug discovery, clinical development, and disease diagnosis. Pfizer, for example, is accelerating its immuno-oncology research with IBM Watson. IBM’s fastest supercomputer is called “Sequoia”, and operates at 20 petaFLOPS. This computer is used by Atomwise (see above) and allows the AtomNetTM AI algorithm to evaluate 8.2 million compounds in a matter of days.
Nvidia is changing the supercomputing paradigm by introducing new computing models that will accelerate both artificial intelligence and high-performance computing (HPC) applications. These new advancements caused their stock to soar by 81.3% in 2017. Nvidia provides heterogeneous computing solutions that use multiple graphics processing units (GPUs) as co-processors, each functioning as a distinct fast compute node. Because their technology massively reduces the power consumption required to reach higher processing speeds, it has also made running artificial intelligence applications more accessible—in desktops, notebooks, servers, and supercomputers.
In 2017, Nvidia released its Volta processors which use a so-called “tensor” microarchitecture, also used by Google’s AlphaGo Zero, that is optimized for deep learning. This microarchitecture is used in their consumer GPU model called ‘Titan V’, which delivers about performance of about 15 teraFLOPS on a classical benchmark and 120 teraFLOPS on a tensor benchmark, while using only 600W of power. Nvidia also supplies their Volta processors as part of their GPU-based cloud services, their data centre GPU ‘Tesla V100’, and their desktop AI supercomputer system called ‘DGX-1’. With the new tensor cores, the DGX-1 system delivers an astonishing 960 teraFLOPS, and can drastically increase the performance of machine learning applications. BenevolentAI (see above) has been using the previous version of DGX-1 (170 teraFLOPS) for their Judgment Augmented Cognition System™, making in silico drug discovery faster and more efficient than ever.
The next frontier of computing is quantum computing. Companies are racing to produce stable and application-ready quantum systems. Quantum computers use single particles, or qubits, to encode information. This approach would enable exponential computing power in a small device with low power consumption. A system with only 50 qubits, for example, could theoretically outperform current supercomputers (Figure 3 & 4). Keeping these qubits stable is, however, has proven to be a difficult engineering challenge.
The qubits must be kept below -273°C, in a high vacuum, and shielded from forces such as vibration and magnetic fields. Although deemed powerful for simulations, analytics, and data processing, the software tools and applications necessary to make use of this power are still under development.
D-wave, founded in 1999, was the first private company to produce quantum computers. Their D-Wave 2000Q provides 2,048 qubits and only consumes 25 kW of power. Although D-Wave’s systems have been used by Google, NASA, Volkswagen, and other organisations, it is not programmable for general purpose applications. It can only be used for specific tasks, such as optimization, sampling, anomaly detection, and image analysis. D-Wave’s qubits are also fragile and difficult to manipulate. Companies such as IBM (50-qubit), Intel (49-qubit), and Rigetti (19-qubit) have more credible roadmaps for quantum computing, and their systems are cloud-based and open source. Thousands of scientists have already used them to run simulations and machine learning tasks. In the pharma industry, Amgen and Biogen are investing in quantum computing for quantum chemistry simulations that can be used for drug discovery. Biogen has partnered with Accenture Labs and 1Qbit, a quantum computing software company, to produce tailored molecular comparison quantum-software, hoping to speed up drug discovery for neurological disorders.
Figure 3. Computing Power in Perspective - Floating Operations Per Second (FLOPS) Benchmarking.
Big data - The fuel
Supercomputers and deep learning algorithms can only produce insights from the data that is provided to them. Often those in the best position to take advantage of AI are those who have access to the best data, rather than the superior algorithms or the most powerful processors. True, modern machine learning algorithms can analyse a variety of unstructured data sources, such as the massive database of peer-reviewed life science articles on PubMed. However, more valuable insights can be found by examining more structured sources of information, and this is certainly the case when considering applications like drug discovery.
We live in the age of the ‘information explosion’. 90% of all data that currently exists has been created in the last two years, and each day, the world produces about 2.5 exabytes, or 2.5x10¹⁸ bytes of new information. Much of this data is, however, dispersed, inaccessible, and uncurated. Different private and public organisations are now focusing on aggregating medical data so that it can be used more effectively. For drug discovery specifically, there are numerous public databases that can be mined. These databases fall into 3 categories:
1) Molecular Biology Databases used to identify disease targets, including -omics data (genomic, transcriptomic, proteomic, metabonomic), molecular interactions, gain and loss of function, and microscopy images. Databases: dbSNP, dbVar, COSMIC, 1000 Genomes Project, TCGA, Gene Expression Omnibus, ArrayExpress, Cancer Genome Atlas, GTEx Portal, Encode, Human Protein Atlas, Human Proteome Map, Cancer Cell Line Encyclopaedia, Project Achilles, etc.
2) Structure-Function Databases used to identify novel drug leads, including molecular structures, drug-target interaction, and structure-function relations. Databases: LINCS, Connectivity Map, ChEMBL, PubChem, etc.
3) Clinical Trial Databases used to predict drug responses, including drug efficacy, toxicity, and ADME. Databases: Cancer Therapeutics Response Portal, ImmPort, ClinicalTrials.gov, PharmaGKB, etc.
There are also private companies that are monetizing their work aggregation and structuring data. Often these companies use machine learning tools to mine and curate data. Both Innoplexus and NuMedii, for example, tap into molecular, biological, and clinical databases to provide annotated, curated, and normalised data that can be more easily used for drug discovery. Other companies are tackling the vast amount of data that will be generated by new sources, like next-generation sequencers. Verily, an Alphabet company, has a genetic data collection initiative, and are working with Biogen to identify the causes of multiple sclerosis, and with AstraZeneca to identify ways to prevent and reverse coronary heart disease. 23andMe, a consumer-focused genetic analysis company, is also engaging in partnerships with Lundbeck and Pfizer to use its collection of genetic data in the drug development process. There is, however, much more effort needed to centralise and standardize the data produced by various biological and medical research institutes. On this front, sharing services such as EU’s Corbel are leading the way.
Patient data, such as insurance data, public health data, mobile health data, patient reporting data, omics data, electronic health records (HER) data, familial data, and environmental data, are also available for analysis. These sources could provide insights into disease progression and treatment, and support the development of new approaches to healthcare, such as outcome-based models and patient-facing services. Data mining is often necessary when working with these kinds of sources, because about 80% of healthcare data is unstructured. Significant concerns remain, however, regarding the privacy protections for patient health data. For example, Google DeepMind’s agreement to access kidney failure data from the UK National Health Service’s records led to a backlash due to concerns about patient privacy. Companies such as IQVIA approach this problem by investing in robust privacy and security measures. IQVIA buys and curates data from pharmacy suppliers and EHR systems. Pharma companies can then use IQVIA’s data to optimize their clinical development strategies.
With the advent of the Internet-of-Things (IoT), the amount of patient-specific data will increase rapidly. Although mining this data for insights may prove challenging, it could eventually lead to an even better understanding of health and disease.
IoT health solutions such as clinical-grade biometric sensors, home monitors, and fitness wearables will add to the vast amounts of data that can be used to predict novel disease targets and repurpose drugs.
Proteus Digital Health, for example, uses ingestible sensors on pills to not only track adherence but also to track symptoms. Companies such as Quantus and MC10 produce clinical-grade wearable biometric sensors that can track a patient’s various vital signs.
The computational tools provided by AI have incredible potential to improve pharmaceutical research in a variety of areas, including disease target identification, compound screening, de novo drug design, and clinical prediction. The recognition of this potential is evident from the increasing number of AI technology providers, and the numerous examples of AI-based approaches being tested or used in the pharmaceutical industry. For instance, BenevolentAI and Johnson & Johnson used a machine learning solution to identify a drug that could possibly be repurposed to treat symptoms of drowsiness relating to Parkinson’s disease, which is currently undergoing Phase IIb clinical trials. Dozens of other pharma and biotech companies have initiated collaborations with AI companies, aiming to capitalize on the recent opportunities created by advances in machine learning and supercomputing. However, it will be important for companies to invest in the most effective machine learning tools for their specific applications. Each approach has its own strengths and weaknesses, and is often best suited for a specific application. With the ever-increasing computational power through supercomputers innovation, novel GPU-based AI accelerators, and quantum computing, the impact of AI-based approaches on the drug development process will only increase. We are only at the start of the data-era. With more data pouring in due to advanced research technology (e.g., next-generation sequencing), the ongoing digitisation of healthcare, and Internet-of-Things devices, more insights will be gained. Soon, artificial intelligence will not only perform most parts of drug development—from discovery to clinical trials—but may also find disruptive new treatments for complex diseases.
KEY INNOVATORS IN THE TECHNOLOGY AREA
Proteus Digital Health
ADOPTERS IN PHARMA
Johnson & Johnson
Sumitomo Dainippon Pharma