Pipeline Physics Logo

Pipeline Physics

Pipeline Physics Logo
Pipeline Physics produces profit
Gary Summers, PhD 1700 University Blvd, #936
President, Pipeline Physics LLC Round Rock, TX 78665-8016
gary.summers@PipelinePhysics.com 503-332-4095

[an error occurred while processing this directive]

Estimating probabilities of success - it's not so successful

Some estimates find the cost of developing a drug to exceed $2 billion. One exceptional expense is the high rate of failures for phases I, II and III clinical trials. The failure rates are high because each phase advances many unmarketable (unsafe or ineffective) compounds downstream. These events are called false-positives. Additionally, each phase cancels some marketable (safe and effective) compounds. These events are called false-negatives. False-positives are harmful because each one eliminates a compound that could have been a successful product.

A phase's ability to distinguish marketable from unmarketable compounds is called its resolution. Higher resolution diminishes the false-positive and false-negative rates, so increasing resolution may be the most effective method of reducing drug development costs. Meanwhile, any practice that reduces resolution is extremely costly.

Current best practices recommend evaluating compounds with expected net present values (eNPV), and this metric requires probability estimates. For example, Figure 1 presents a decision tree for estimating the eNPV of a compound being considered for phase I trials. It requires four probabilities: the probability of technical success for phase I, phase II, phase III and the new drug application (NDA). The resolution produced by eNPVs depends on these probability estimates, so we should ask, "How do these probabilities affect a phase's resolution?" To answer this question let's look at probability estimates and forecasts.

(The resolution produced by eNPVs depend even more on revenue estimates, but these estimates can be highly erroneous. See my discussion, "Revenue forecasting errors dominate project evaluations.")

A decision tree for a compound entering clinical trials in drug development.

Figure 1: A decision tree for a compound entering clinical trials in drug development. The four chance nodes (branching points) represent the three phases of clinical trials and FDA approval. The top branch of each node represents success and the bottom branch represents failure. For example, the drug has a 70% chance of success in phase I clinical trials, and if successful, it has a 45% chance of success in phase II trials. The red numbers within the decision tree represent the costs of each stage, while the green number represents revenue (all presented as present values). For example, phase I clinical trials costs $3 million and phase II trials costs $6.5 million.

Probability estimates

Perhaps the most common method of obtaining probabilities is to ask experts to estimate them, often with guidance form decision scientists via a process called solicitation. The field of decision theory offers many strong papers that provide excellent advice for soliciting probability estimates. A voluminous portion of this literature studies biases that arise from managerial and expert judgment, which cause probability estimates to be either too high or too low. As I describe below, biased probabilities are probably a minor problem. The major problem, which is unstudied, but a subject of my current research, is diminished resolution. Estimating probabilities may diminish resolution considerably when compared to the resolution provided by raw clinical data. If so, the "best" practice of evaluating compounds via eNPVs would be harmful, significantly raises drug development costs and preventing many safe and effective drugs from reaching patients.

Before addressing resolution, let's look at the lessor problem: biases. Scholars test for biases by grouping probability estimates together and seeing if the group's predictions match results. For example, suppose one estimated the probability of technical success for many compounds in phase I, II and III trials and the NDA. Subsequently, one groups the probability estimates into categories of 0%-10%, 11%-20%, 21%-30%, 31%-40%, 41%-50%, 51%-60%, 61%-70%, 71%-80% 81%-90% and 91%-100%. Once the compounds' results are known one can compare each group's predicted success rate to its actual success rate. Figure 2 illustrates these plots, which we'll call calibration plots.

In Figure 2, the left plot shows well calibrated probabilities. The predictions perfectly match the results, so the point lie on the diagonal line. The middle plot shows pessimistic biases. The predicted success rates are less than the actual success rates, so the points lie below the diagonal line. The right plot shows optimistic biases. The predicted success rates are greater than the actual success rates, so the points lie above the diagonal line.

Charts showing when estimated probabilities are well-calibrated, underestimated or overestimated.

Figure 2: Calibration charts showing well-calibrated probability estimates, pessimistic probability estimates and optimistic probability estimates.

Experts at Eli Lilly perform this using more than 730 probability estimates. The estimates were made by an independent review board, guided by experts who were trained in reducing biases. The review board made the following estimates:

Eli Lilly's predicted success rates were very close to the actual success rates. On their calibration plot, their points were remarkably close to the diagonal line. Perhaps other company's estimates are not as well calibrated, but Eli Lilly has proven that companies can reduce biases to produce well-calibrated probability estimates. These exceptional results are published in the following paper, which available from the Decision Analysis Affinity Group:

Andersen, J. (2011), "Probability elicitation and calibration in a research and development portfolios: a 13-year case study," presented at the 17th Annual Decision Analysis Affinity Group Conference.

Unfortunately, well-calibrated probability estimates are not sufficient for success. (They might not even be necessary.) This fact is easy to show. Suppose 40% of phase II compounds advance to phase III. Assigning every compound a 40% probability of success produces perfectly calibrated probabilities, but these probabilities are useless. They provide no ability to distinguish marketable from unmarketable compounds, so they produce the highest possible false-positive rates and false-negative rates and thus the highest possible drug development costs.

If calibration does not ensure success, what qualities must probabilities possess to make drug development more productive? To illustrate the quality, let's subtly but importantly change the probabilities. Instead of predicting success in a phase, suppose experts estimate the probability of each compound being marketable. To build an example, suppose 34% of the compounds evaluated in phase II are marketable, all the probability estimates are well-calibrated and consider three different probabilities assignments for the compounds:

Figure 3 presents the calibration plots for these estimates. Assigning a 34% chance of success to each compound provides no ability to distinguish the marketable compounds from their unmarketable counterparts (green triangle). Assigning each compound either a 20% or a 50% chance provides moderate resolution (blue diamonds). Compared to the blanket 34% estimates, false-positive and false-negative rates fall, as due drug development costs. Correctly assigning each compound a 0% or 100% chanced perfectly categorizes the compounds produces perfect (infinite) resolution (orange squares). Perfect resolution eliminates all false-positives and falsed-negatives and minimizes drug development costs.

Calibration chart showing probability estiamtes with perfect, moderate and no resolution.

Figure 3: Calibration chart showing probability estimates with perfect, moderate and no resolution..

As Figure 2 illustrates, what matters is not calibration but resolution — how well one can distinguish marketable from unmarketable compounds — "spreading" the estimated probabilities towards their extreme values of 0% and 100%. The further the spread, the better the resolution.

From the discussion of Eli Lilly's probability estimates, notice how their experts base the probability estimates on the results of clinical trials. One can ask the following questions:

If the answers are yes and no, respectively, using the common metric of decision analysis and financial theory, eNPV, harms drug development, increases false-positive and false-negative rates and raises costs.

My current research is answering the above questions to determine the compound evaluation methods that produce the highest resolution.

Technical note: biases matters as resolution are related in the following way. Trying to "spread" the probabilities more than one's information and cognitive processes permit creates the following biases. For compounds that recieve high probabilities (of being marketable), one overestimates the probabilities. For compounds that recieve low probabilities, one underestimates the probabilities.

Probability forecasts

Figure 1 presents an curious situation. The decision tree requires estimating downstream probabilities. Specifically, the compound is being considered for phase I, but the evaluation method requires estimating the (conditional) probabilities of success in phase II, phase III and the NDA. The information needed for these estimates does not yet exist. For example, estimating the probability of success in phase III is best done with the results from phase II trials, but the compound has not yet entered phase I. For this reason I call these downstream probabilities forecasts, rather than estimates. (Form the above discussion notice how Eli Lilly did not try ot forecast downstream probabilities.)

To consider the resolution produced by forecasting downstream probabilities, one needs the irresolution ratio, which is:


(Average error in an estimate)/(Variation in estimates) = (Variation from noise)/(Variation from noise + Variation from signal)

The irresolution ratio is an important but underutilized metric in PPM. It identifies the best attributes to use in scoring models, measures the quality of project evaluations and identifies the best method of project evaluation. Additionally, it's used to calculate the reliability of information, an important metric, described in my discussion, "Where's the feedback?"

All data contains signal and noise. The irresolution ratio compares the noise to the signal. Let's look closely at its parts. The numerator measures the error in estimates of a value, such as the errors in forecasting revenues or in estimating projects' probabilities of success. The denominator presents the total variation in an estimated variable, which is the variation from the signal plus the variation from the noise.

Let's first address the variation of the signal. A weak signal varies little over projects. For example, if one assigns all projects the same probability of success, the estimates do not distinguish the projects. Then the variation of the signal is zero and the irresolution ratio is one, its worst value.

At the opposite extreme the variation of the signal is much larger than the variation of the noise. Suppose one accurately and precisely estimates projects' probabilities of success, so estimation errors are insignificant. The irresolution ratio is very small, close to zero. Then the estimated probabilities contribute greatly to PPM, aptly distinguishing projects by their likelihood of success.

Now, let's consider estimating the downstream probabilities. Recall, from Figure 1, experts must estimate the (conditional) probabilities of success for phase II, phase II and the NDA, but since the compound has yet to enter phase I, they lack the information they need. Their estimates will contain large errors, so the irresolution ratio for the estimates will be close to one.

An alternative to expert guesstimates is to estimate the downstream probabilities by using historical data. The success rate of previous projects at a gate is assigned to each new proposal. The problem with this approach is that each project receives the same estimate, so the variation of the signal is zero and the irresolution ratio is one.

One can see the catch-22. Expert estimates produce a large numerator in the irresolution ratio and historical data produces a small denominator. Combining the two estimates fails as well, whether one combines the estimates in a weighted average or by using Bayes' law. In both cases a estimate of success at downstream gates contains more noise than signal, so managers are unable to distinguish valuable projects from less value ones. Likely, decision trees like the one in Figure 1 have low resolution, increase false-positive and false-negative rates and raise drug development costs.


After reading my discussions, many managers wish to share their experiences, thoughts and critiques of my ideas. I always welcome and reply to their comments.

Please share your thoughts with me by using form below. I will send reply to you via email. If you prefer to be contacted by phone, fax or postal mail, please send your comments via my contact page.


Contact Information

 First name
 Last name
 Title
 Company
 E-mail address