A receiver operating characteristic curve , or ROC curve , is a graphical plot that illustrates the performance of a binary classifier model (can be used for multi class classification as well) at varying threshold values.
99-448: The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold setting. The ROC can also be thought of as a plot of the statistical power as a function of the Type I Error of the decision rule (when the performance is calculated from just a sample of the population, it can be thought of as estimators of these quantities). The ROC curve
198-436: A contingency table (also known as a cross tabulation or crosstab ) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term contingency table
297-439: A true negative (TN) has occurred when both the prediction outcome and the actual value are n , and a false negative (FN) is when the prediction outcome is n while the actual value is p . To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to determine whether a person has a certain disease. A false positive in this case occurs when the person tests positive, but does not actually have
396-402: A 2 × 2 contingency table is the odds ratio . Given two events, A and B, the odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due to symmetry), the ratio of the odds of B in the presence of A and the odds of B in the absence of A. Two events are independent if and only if the odds ratio is 1; if the odds ratio
495-550: A ROC curve, only the true positive rate (TPR) and false positive rate (FPR) are needed (as functions of some classifier parameter). The TPR defines how many correct positive results occur among all positive samples available during the test. FPR, on the other hand, defines how many incorrect positive results occur among all negative samples available during the test. A ROC space is defined by FPR and TPR as x and y axes, respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs). Since TPR
594-401: A condition. Mathematically, this can be written as: A positive result in a test with high specificity can be useful for "ruling in" disease, since the test rarely gives positive results in healthy patients. A test with 100% specificity will recognize all patients without the disease by testing negative, so a positive test result would definitively rule in the presence of the disease. However,
693-425: A disease. Each person taking the test either has or does not have the disease. The test outcome can be positive (classifying the person as having the disease) or negative (classifying the person as not having the disease). The test results for each subject may or may not match the subject's actual status. In that setting: After getting the numbers of true positives, false positives, true negatives, and false negatives,
792-516: A given confidence level (e.g., 95%). In information retrieval , the positive predictive value is called precision , and sensitivity is called recall . Unlike the Specificity vs Sensitivity tradeoff, these measures are both independent of the number of true negatives, which is generally unknown and much larger than the actual numbers of relevant and retrieved documents. This assumption of very large numbers of true negatives versus positives
891-418: A high false positive rate, and it does not reliably identify colorectal cancer in the overall population of asymptomatic people (PPV = 10%). On the other hand, this hypothetical test demonstrates very accurate detection of cancer-free individuals (NPV ≈ 99.5%). Therefore, when used for routine colorectal cancer screening with asymptomatic adults, a negative result supplies important data for
990-399: A high specificity. This is especially important when people who are identified as having a condition may be subjected to more testing, expense, stigma, anxiety, etc. The terms "sensitivity" and "specificity" were introduced by American biostatistician Jacob Yerushalmy in 1947. There are different definitions within laboratory quality control , wherein "analytical sensitivity" is defined as
1089-420: A lower type I error rate. The above graphical illustration is meant to show the relationship between sensitivity and specificity. The black, dotted line in the center of the graph is where the sensitivity and specificity are the same. As one moves to the left of the black dotted line, the sensitivity increases, reaching its maximum value of 100% at line A, and the specificity decreases. The sensitivity at line A
SECTION 10
#17327824614181188-969: A lower value on the x -axis) where X 1 {\displaystyle X_{1}} is the score for a positive instance and X 0 {\displaystyle X_{0}} is the score for a negative instance, and f 0 {\displaystyle f_{0}} and f 1 {\displaystyle f_{1}} are probability densities as defined in previous section. If X 0 {\displaystyle X_{0}} and X 1 {\displaystyle X_{1}} follows two Gaussian distributions, then A = Φ ( ( μ 1 − μ 0 ) / σ 1 2 + σ 0 2 ) {\displaystyle A=\Phi \left((\mu _{1}-\mu _{0})/{\sqrt {\sigma _{1}^{2}+\sigma _{0}^{2}}}\right)} . It can be shown that
1287-528: A maximum of 1.0 when there is complete association in a table of any number of rows and columns by dividing C by k − 1 k {\displaystyle {\sqrt {\frac {k-1}{k}}}} where k is the number of rows or columns, when the table is square , or by r − 1 r × c − 1 c 4 {\displaystyle {\sqrt[{\scriptstyle 4}]{{r-1 \over r}\times {c-1 \over c}}}} where r
1386-410: A negative result from a test with high specificity is not necessarily useful for "ruling out" disease. For example, a test that always returns a negative test result will have a specificity of 100% because specificity does not consider false negatives. A test like that would return negative for patients with the disease, making it useless for "ruling out" the disease. A test with a higher specificity has
1485-442: A practice is rare. For more on the use of a contingency table for the relation between two ordinal variables, see Goodman and Kruskal's gamma . The degree of association between the two variables can be assessed by a number of coefficients. The following subsections describe a few of them. For a more complete discussion of their uses, see the main articles linked under each subsection heading. The simplest measure of association for
1584-515: A probability density f 1 ( x ) {\displaystyle f_{1}(x)} if the instance actually belongs to class "positive", and f 0 ( x ) {\displaystyle f_{0}(x)} if otherwise. Therefore, the true positive rate is given by TPR ( T ) = ∫ T ∞ f 1 ( x ) d x {\displaystyle {\mbox{TPR}}(T)=\int _{T}^{\infty }f_{1}(x)\,dx} and
1683-453: A randomly chosen negative one (assuming 'positive' ranks higher than 'negative'). In other words, when given one randomly selected positive instance and one randomly selected negative instance, AUC is the probability that the classifier will be able to tell which one is which. This can be seen as follows: the area under the curve is given by (the integral boundaries are reversed as large threshold T {\displaystyle T} has
1782-407: A result from a contingency table is to the upper left corner, the better it predicts, but the distance from the random guess line in either direction is the best indicator of how much predictive power a method has. If the result is below the line (i.e. the method is worse than a random guess), all of the method's predictions must be reversed in order to utilize its power, thereby moving the result above
1881-463: A single prediction point with DeltaP' = Informedness = 2AUC-1, whilst DeltaP = Markedness represents the dual (viz. predicting the prediction from the real class) and their geometric mean is the Matthews correlation coefficient . Whereas ROC AUC varies between 0 and 1 — with an uninformative classifier yielding 0.5 — the alternative measures known as Informedness , Certainty and Gini Coefficient (in
1980-456: A specific region of the ROC Curve rather than at the whole curve. It is possible to compute partial AUC . For example, one could focus on the region of the curve with low false positive rate, which is often of prime interest for population screening tests. Another common approach for classification problems in which P ≪ N (common in bioinformatics applications) is to use a logarithmic scale for
2079-404: A two-class prediction problem ( binary classification ), in which the outcomes are labeled either as positive ( p ) or negative ( n ). There are four possible outcomes from a binary classifier. If the outcome from a prediction is p and the actual value is also p , then it is called a true positive (TP); however if the actual value is n then it is said to be a false positive (FP). Conversely,
SECTION 20
#17327824614182178-440: A very large population as part of a study of sex differences in handedness. A contingency table can be created to display the numbers of individuals who are male right-handed and left-handed, female right-handed and left-handed. Such a contingency table is shown below. The numbers of the males, females, and right- and left-handed individuals are called marginal totals . The grand total (the total number of individuals represented in
2277-482: Is 100% because at that point there are zero false negatives, meaning that all the negative test results are true negatives. When moving to the right, the opposite applies, the specificity increases until it reaches the B line and becomes 100% and the sensitivity decreases. The specificity at line B is 100% because the number of false positives is zero at that line, meaning all the positive test results are true positives. The middle solid line in both figures above that show
2376-402: Is 37 + 8 = 45, which gives a sensitivity of 37 / 45 = 82.2 %. There are 40 - 8 = 32 TN. The specificity therefore comes out to 32 / 35 = 91.4%. The red dot indicates the patient with the medical condition. The red background indicates the area where the test predicts the data point to be positive. The true positive in this figure is 6, and false negatives of 0 (because all positive condition
2475-419: Is a mapping of instances between certain classes/groups. Because the classifier or diagnosis result can be an arbitrary real value (continuous output), the classifier boundary between classes must be determined by a threshold value (for instance, to determine whether a person has hypertension based on a blood pressure measure). Or it can be a discrete class label, indicating one of the classes. Consider
2574-523: Is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives: If the true status of the condition cannot be known, sensitivity and specificity can be defined relative to a " gold standard test " which is assumed correct. For all testing, both diagnoses and screening , there is usually a trade-off between sensitivity and specificity, such that higher sensitivities will mean lower specificities and vice versa. A test which reliably detects
2673-400: Is also called a perfect classification . A random guess would give a point along a diagonal line (the so-called line of no-discrimination ) from the bottom left to the top right corners (regardless of the positive and negative base rates ). An intuitive example of random guessing is a decision by flipping coins. As the size of the sample increases, a random classifier's ROC point tends towards
2772-565: Is also known as sensitivity or probability of detection . The false-positive rate is also known as the probability of false alarm and equals (1 − specificity ). The ROC is also known as a relative operating characteristic curve, because it is a comparison of two operating characteristics (TPR and FPR) as the criterion changes. The ROC curve was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battlefields, starting in 1941, which led to its name ("receiver operating characteristic"). It
2871-458: Is computed as in Pearson's chi-squared test , and N is the grand total of observations. φ varies from 0 (corresponding to no association between the variables) to 1 or −1 (complete association or complete inverse association), provided it is based on frequency data represented in 2 × 2 tables. Then its sign equals the sign of the product of the main diagonal elements of the table minus
2970-400: Is correctly predicted as positive). Therefore, the sensitivity is 100% (from 6 / (6 + 0) ). This situation is also illustrated in the previous figure where the dotted line is at position A (the left-hand side is predicted as negative by the model, the right-hand side is predicted as positive by the model). When the dotted line, test cut-off line, is at position A, the test correctly predicts all
3069-438: Is defined as: An estimate of d′ can be also found from measurements of the hit rate and false-alarm rate. It is calculated as: where function Z ( p ), p ∈ [0, 1], is the inverse of the cumulative Gaussian distribution . d′ is a dimensionless statistic. A higher d′ indicates that the signal can be more readily detected. The relationship between sensitivity, specificity, and similar terms can be understood using
Receiver operating characteristic - Misplaced Pages Continue
3168-524: Is equivalent to sensitivity and FPR is equal to 1 − specificity , the ROC graph is sometimes called the sensitivity vs (1 − specificity) plot. Each prediction result or instance of a confusion matrix represents one point in the ROC space. The best possible prediction method would yield a point in the upper left corner or coordinate (0,1) of the ROC space, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). The (0,1) point
3267-435: Is evident that the x coordinate is 0.2 and the y coordinate is 0.3. However, these two values are insufficient to construct all entries of the underlying two-by-two contingency table. An alternative to the ROC curve is the detection error tradeoff (DET) graph, which plots the false negative rate (missed detections) vs. the false positive rate (false alarms) on non-linearly transformed x- and y-axes. The transformation function
3366-397: Is greater than 1, the events are positively associated; if the odds ratio is less than 1, the events are negatively associated. The odds ratio has a simple expression in terms of probabilities; given the joint probability distribution: the odds ratio is: A simple measure, applicable only to the case of 2 × 2 contingency tables, is the phi coefficient (φ) defined by where χ
3465-473: Is less. C suffers from the disadvantage that it does not reach a maximum of 1.0, notably the highest it can reach in a 2 × 2 table is 0.707 . It can reach values closer to 1.0 in contingency tables with more categories; for example, it can reach a maximum of 0.870 in a 4 × 4 table. It should, therefore, not be used to compare associations in different tables if they have different numbers of categories. C can be adjusted so it reaches
3564-450: Is not applicable in the present context. A sensitive test will have fewer Type II errors . Similarly to the domain of information retrieval , in the research area of gene prediction , the number of true negatives (non-genes) in genomic sequences is generally unknown and much larger than the actual number of genes (true positives). The convenient and intuitively understood term specificity in this research area has been frequently used with
3663-483: Is often claimed that a highly specific test is effective at ruling in a disease when positive, while a highly sensitive test is deemed effective at ruling out a disease when negative. This has led to the widely used mnemonics SPPIN and SNNOUT, according to which a highly sp ecific test, when p ositive, rules in disease (SP-P-IN), and a highly s e n sitive test, when n egative, rules out disease (SN-N-OUT). Both rules of thumb are, however, inferentially misleading, as
3762-402: Is rare in other applications. The F-score can be used as a single measure of performance of the test for the positive class. The F-score is the harmonic mean of precision and recall: In the traditional language of statistical hypothesis testing , the sensitivity of a test is called the statistical power of the test, although the word power in that context has a more general usage that
3861-578: Is referred to as Gini index or Gini coefficient, but it should not be confused with the measure of statistical dispersion that is also called Gini coefficient . G 1 {\displaystyle G_{1}} is a special case of Somers' D . It is also common to calculate the Area Under the ROC Convex Hull (ROC AUCH = ROCH AUC) as any point on the line segment between two prediction results can be achieved by randomly using one or
3960-692: Is said that there is a contingency between the two variables. In other words, the two variables are not independent. If there is no contingency, it is said that the two variables are independent . The example above is the simplest kind of contingency table, a table in which each variable has only two levels; this is called a 2 × 2 contingency table. In principle, any number of rows and columns may be used. There may also be more than two variables, but higher order contingency tables are difficult to represent visually. The relation between ordinal variables , or between ordinal and categorical variables, may also be represented in contingency tables, although such
4059-457: Is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system, as well as ignoring the possibility of concavity repair, so that related alternative measures such as Informedness or DeltaP are recommended. These measures are essentially equivalent to the Gini for
Receiver operating characteristic - Misplaced Pages Continue
4158-704: Is the number of rows and c is the number of columns. Another choice is the tetrachoric correlation coefficient but it is only applicable to 2 × 2 tables. Polychoric correlation is an extension of the tetrachoric correlation to tables involving variables with more than two levels. Tetrachoric correlation assumes that the variable underlying each dichotomous measure is normally distributed. The coefficient provides "a convenient measure of [the Pearson product-moment] correlation when graduated measurements have been reduced to two categories." The tetrachoric correlation coefficient should not be confused with
4257-418: Is the quantile function of the normal distribution, i.e., the inverse of the cumulative normal distribution. It is, in fact, the same transformation as zROC, below, except that the complement of the hit rate, the miss rate or false negative rate, is used. This alternative spends more graph area on the region of interest. Most of the ROC area is of little interest; one primarily cares about the region tight against
4356-414: Is the set of negative examples, and D 1 {\displaystyle {\mathcal {D}}^{1}} is the set of positive examples. In the context of credit scoring , a rescaled version of AUC is often used: G 1 = 2 AUC − 1 {\displaystyle G_{1}=2\operatorname {AUC} -1} . G 1 {\displaystyle G_{1}}
4455-429: Is then 26, and the number of false positives is 0. This result in 100% specificity (from 26 / (26 + 0) ). Therefore, sensitivity or specificity alone cannot be used to measure the performance of the test. In medical diagnosis , test sensitivity is the ability of a test to correctly identify those with the disease (true positive rate), whereas test specificity is the ability of the test to correctly identify those without
4554-419: Is thus the sensitivity as a function of false positive rate . Given that the probability distributions for both true positive and false positive are known, the ROC curve is obtained as the cumulative distribution function (CDF, area under the probability distribution from − ∞ {\displaystyle -\infty } to the discrimination threshold) of the detection probability in
4653-451: The Pearson correlation coefficient computed by assigning, say, values 0.0 and 1.0 to represent the two levels of each variable (which is mathematically equivalent to the φ coefficient). The lambda coefficient is a measure of the strength of association of the cross tabulations when the variables are measured at the nominal level . Values range from 0.0 (no association) to 1.0 (the maximum possible association). Asymmetric lambda measures
4752-423: The gold standard four times, but a single additional test against the gold standard that gave a poor result would imply a sensitivity of only 80%. A common way to do this is to state the binomial proportion confidence interval , often calculated using a Wilson score interval. Confidence intervals for sensitivity and specificity can be calculated, giving the range of values within which the correct value lies at
4851-674: The x -axis. The ROC area under the curve is also called c-statistic or c statistic . The Total Operating Characteristic (TOC) also characterizes diagnostic ability while revealing more information than the ROC. For each threshold, ROC reveals two ratios, TP/(TP + FN) and FP/(FP + TN). In other words, ROC reveals hits hits + misses {\displaystyle {\frac {\text{hits}}{{\text{hits}}+{\text{misses}}}}} and false alarms false alarms + correct rejections {\displaystyle {\frac {\text{false alarms}}{{\text{false alarms}}+{\text{correct rejections}}}}} . On
4950-407: The y -axis and the top left corner – which, because of using miss rate instead of its complement, the hit rate, is the lower left corner in a DET plot. Furthermore, DET graphs have the useful property of linearity and a linear threshold behavior for normal distributions. The DET plot is used extensively in the automatic speaker recognition community, where the name DET was first used. The analysis of
5049-450: The y -axis versus the CDF of the false positive probability on the x -axis. ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to the cost/benefit analysis of diagnostic decision making . The true-positive rate
SECTION 50
#17327824614185148-838: The AUC is closely related to the Mann–Whitney U , which tests whether positives are ranked higher than negatives. For a predictor f {\textstyle f} , an unbiased estimator of its AUC can be expressed by the following Wilcoxon-Mann-Whitney statistic: where 1 [ f ( t 0 ) < f ( t 1 ) ] {\textstyle {\textbf {1}}[f(t_{0})<f(t_{1})]} denotes an indicator function which returns 1 if f ( t 0 ) < f ( t 1 ) {\displaystyle f(t_{0})<f(t_{1})} otherwise return 0; D 0 {\displaystyle {\mathcal {D}}^{0}}
5247-526: The ROC AUC statistic for model comparison. This practice has been questioned because AUC estimates are quite noisy and suffer from other problems. Nonetheless, the coherence of AUC as a measure of aggregated classification performance has been vindicated, in terms of a uniform rate distribution, and AUC has been linked to a number of other performance metrics such as the Brier score . Another problem with ROC AUC
5346-450: The ROC is used to generate a summary statistic. Common versions are: However, any attempt to summarize the ROC curve into a single number loses information about the pattern of tradeoffs of the particular discriminator algorithm. The area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than
5445-447: The ROC performance in graphs with this warping of the axes was used by psychologists in perception studies halfway through the 20th century, where this was dubbed "double probability paper". If a standard score is applied to the ROC curve, the curve will be transformed into a straight line. This z-score is based on a normal distribution with a mean of zero and a standard deviation of one. In memory strength theory , one must assume that
5544-473: The [0, 1] range. If one performed a binary classification, obtained an ROC AUC of 0.9 and decided to focus only on this metric, they might overoptimistically believe their binary test was excellent. However, if this person took a look at the values of precision and negative predictive value, they might discover their values are low. The ROC AUC summarizes sensitivity and specificity, but does not inform regarding precision and negative predictive value. Sometimes,
5643-427: The analysis (the number of exclusions should be stated when quoting sensitivity) or can be treated as false negatives (which gives the worst-case value for sensitivity and may therefore underestimate it). A test with a higher sensitivity has a lower type II error rate. Consider the example of a medical test for diagnosing a disease. Specificity refers to the test's ability to correctly reject healthy patients without
5742-499: The assessment of the overall performance. Moreover, that portion of AUC indicates a space with high or low confusion matrix threshold which is rarely of interest for scientists performing a binary classification in any field. Another criticism to the ROC and its area under the curve is that they say nothing about precision and negative predictive value. A high ROC AUC, such as 0.9 for example, might correspond to low values of precision and negative predictive value, such as 0.2 and 0.1 in
5841-425: The contingency table) is the number in the bottom right corner. The table allows users to see at a glance that the proportion of men who are right-handed is about the same as the proportion of women who are right-handed although the proportions are not identical. The strength of the association can be measured by the odds ratio , and the population odds ratio estimated by the sample odds ratio . The significance of
5940-434: The data set is equal to TP + FN, or 32 + 3 = 35. The sensitivity is therefore 32 / 35 = 91.4%. Using the same method, we get TN = 40 - 3 = 37, and the number of healthy people 37 + 8 = 45, which results in a specificity of 37 / 45 = 82.2 %. For the figure that shows low sensitivity and high specificity, there are 8 FN and 3 FP. Using the same method as the previous figure, we get TP = 40 - 3 = 37. The number of sick people
6039-564: The diagnostic power of any test is determined by the prevalence of the condition being tested, the test's sensitivity and its specificity. The SNNOUT mnemonic has some validity when the prevalence of the condition in question is extremely low in the tested sample. The tradeoff between specificity and sensitivity is explored in ROC analysis as a trade off between TPR and FPR (that is, recall and fallout ). Giving them equal weight optimizes informedness = specificity + sensitivity − 1 = TPR − FPR,
SECTION 60
#17327824614186138-475: The diagonal line. In the case of a balanced coin, it will tend to the point (0.5, 0.5). The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random); points below the line represent bad results (worse than random). Note that the output of a consistently bad predictor could simply be inverted to obtain a good predictor. Consider four prediction results from 100 positive and 100 negative instances: Plots of
6237-459: The difference between the two proportions can be assessed with a variety of statistical tests including Pearson's chi-squared test , the G -test , Fisher's exact test , Boschloo's test , and Barnard's test , provided the entries in the table represent individuals randomly sampled from the population about which conclusions are to be drawn. If the proportions of individuals in the different columns vary significantly between rows (or vice versa), it
6336-424: The disease (true negative rate). If 100 patients known to have a disease were tested, and 43 test positive, then the test has 43% sensitivity. If 100 with no disease are tested and 96 return a completely negative result, then the test has 96% specificity. Sensitivity and specificity are prevalence-independent test characteristics, as their values are intrinsic to the test and do not depend on the disease prevalence in
6435-453: The disease. A false negative, on the other hand, occurs when the person tests negative, suggesting they are healthy, when they actually do have the disease. Consider an experiment from P positive instances and N negative instances for some condition. The four outcomes can be formulated in a 2×2 contingency table or confusion matrix , as follows: The contingency table can derive several evaluation "metrics" (see infobox). To draw
6534-467: The disease. A test with 100% sensitivity will recognize all patients with the disease by testing positive. In this case, a negative test result would definitively rule out the presence of the disease in a patient. However, a positive result in a test with high sensitivity is not necessarily useful for "ruling in" disease. Suppose a 'bogus' test kit is designed to always give a positive reading. When used on diseased patients, all patients test positive, giving
6633-418: The example of a medical test for diagnosing a condition. Sensitivity (sometimes also named the detection rate in a clinical setting) refers to the test's ability to correctly detect ill patients out of those who do have the condition. Mathematically, this can be expressed as: A negative result in a test with high sensitivity can be useful for "ruling out" disease, since it rarely misdiagnoses those who do have
6732-513: The false positive rate is given by FPR ( T ) = ∫ T ∞ f 0 ( x ) d x {\displaystyle {\mbox{FPR}}(T)=\int _{T}^{\infty }f_{0}(x)\,dx} . The ROC curve plots parametrically TPR ( T ) {\displaystyle {\mbox{TPR}}(T)} versus FPR ( T ) {\displaystyle {\mbox{FPR}}(T)} with T {\displaystyle T} as
6831-470: The figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have. Several studies criticize certain applications of the ROC curve and its area under the curve as measurements for assessing binary classifications when they do not capture
6930-535: The following table. Consider a group with P positive instances and N negative instances of some condition. The four outcomes can be formulated in a 2×2 contingency table or confusion matrix , as well as derivations of several metrics using the four outcomes, as follows: Related calculations This hypothetical screening test (fecal occult blood test) correctly identified two-thirds (66.7%) of patients with colorectal cancer. Unfortunately, factoring in prevalence rates reveals that this hypothetical test has
7029-443: The four results above in the ROC space are given in the figure. The result of method A clearly shows the best predictive power among A , B , and C . The result of B lies on the random guess line (the diagonal line), and it can be seen in the table that the accuracy of B is 50%. However, when C is mirrored across the center point (0.5,0.5), the resulting method C′ is even better than A . This mirrored method simply reverses
7128-548: The information relevant to the application. The main criticism to the ROC curve described in these studies regards the incorporation of areas with low sensitivity and low specificity (both lower than 0.5) for the calculation of the total area under the curve (AUC)., as described in the plot on the right. According to the authors of these studies, that portion of area under the curve (with low sensitivity and low specificity) regards confusion matrices where binary predictions obtain bad results, and therefore should not be included for
7227-476: The level of sensitivity and specificity is the test cutoff point. As previously described, moving this line results in a trade-off between the level of sensitivity and specificity. The left-hand side of this line contains the data points that tests below the cut off point and are considered negative (the blue dots indicate the False Negatives (FN), the white dots True Negatives (TN)). The right-hand side of
7326-411: The line shows the data points that tests above the cut off point and are considered positive (red dots indicate False Positives (FP)). Each side contains 40 data points. For the figure that shows high sensitivity and low specificity, there are 3 FN and 8 FP. Using the fact that positive results = true positives (TP) + FP, we get TP = positive results - FP, or TP = 40 - 8 = 32. The number of sick people in
7425-400: The magnitude of which gives the probability of an informed decision between the two classes (> 0 represents appropriate use of information, 0 represents chance-level performance, < 0 represents perverse use of information). The sensitivity index or d′ (pronounced "dee-prime") is a statistic used in signal detection theory . It provides the separation between the means of
7524-463: The mathematical formula for precision and recall as defined in biostatistics. The pair of thus defined specificity (as positive predictive value) and sensitivity (true positive rate) represent major parameters characterizing the accuracy of gene prediction algorithms. Conversely, the term specificity in a sense of true negative rate would have little, if any, application in the genome analysis research area. Contingency table In statistics ,
7623-450: The other hand, TOC shows the total information in the contingency table for each threshold. The TOC method reveals all of the information that the ROC method provides, plus additional important information that ROC does not reveal, i.e. the size of every entry in the contingency table for each threshold. TOC also provides the popular AUC of the ROC. These figures are the TOC and ROC curves using
7722-423: The other system with probabilities proportional to the relative length of the opposite component of the segment. It is also possible to invert concavities – just as in the figure the worse solution can be reflected to become a better solution; concavities can be reflected in any line segment, but this more extreme form of fusion is much more likely to overfit the data. The machine learning community most often uses
7821-433: The patient and doctor, such as ruling out cancer as the cause of gastrointestinal symptoms or reassuring patients worried about developing colorectal cancer. Sensitivity and specificity values alone may be highly misleading. The 'worst-case' sensitivity or specificity must be calculated in order to avoid reliance on experiments with few results. For example, a particular test may easily show 100% sensitivity if tested against
7920-430: The percentage improvement in predicting the dependent variable. Symmetric lambda measures the percentage improvement when prediction is done in both directions. The uncertainty coefficient , or Theil's U, is another measure for variables at the nominal level. Its values range from −1.0 (100% negative association, or perfect inversion) to +1.0 (100% positive association, or perfect agreement). A value of 0.0 indicates
8019-417: The population of interest. Positive and negative predictive values , but not sensitivity or specificity, are values influenced by the prevalence of disease in the population that is being tested. These concepts are illustrated graphically in this applet Bayesian clinical diagnostic model which show the positive and negative predictive values as a function of the prevalence, sensitivity and specificity. It
8118-411: The population of the true positive class, but it will fail to correctly identify the data point from the true negative class. Similar to the previously explained figure, the red dot indicates the patient with the medical condition. However, in this case, the green background indicates that the test predicts that all patients are free of the medical condition. The number of data point that is true negative
8217-456: The predictions of whatever method or test produced the C contingency table. Although the original C method has negative predictive power, simply reversing its decisions leads to a new predictive method C′ which has positive predictive power. When the C method predicts p or n , the C′ method would predict n or p , respectively. In this manner, the C′ test would perform the best. The closer
8316-468: The presence of a condition, resulting in a high number of true positives and low number of false negatives, will have a high sensitivity. This is especially important when the consequence of failing to treat the condition is serious and/or the treatment is very effective and has minimal side effects. A test which reliably excludes individuals who do not have the condition, resulting in a high number of true negatives and low number of false positives, will have
8415-402: The product of the off–diagonal elements. φ takes on the minimum value −1.0 or the maximum value of +1.0 if and only if every marginal proportion is equal to 0.5 (and two diagonal cells are empty). Two alternatives are the contingency coefficient C , and Cramér's V . The formulae for the C and V coefficients are: k being the number of rows or the number of columns, whichever
8514-556: The random guess line. In binary classification, the class prediction for each instance is often made based on a continuous random variable X {\displaystyle X} , which is a "score" computed for the instance (e.g. the estimated probability in logistic regression). Given a threshold parameter T {\displaystyle T} , the instance is classified as "positive" if X > T {\displaystyle X>T} , and "negative" otherwise. X {\displaystyle X} follows
8613-764: The same data and thresholds. Consider the point that corresponds to a threshold of 74. The TOC curve shows the number of hits, which is 3, and hence the number of misses, which is 7. Additionally, the TOC curve shows that the number of false alarms is 4 and the number of correct rejections is 16. At any given point in the ROC curve, it is possible to glean values for the ratios of false alarms false alarms + correct rejections {\displaystyle {\frac {\text{false alarms}}{{\text{false alarms}}+{\text{correct rejections}}}}} and hits hits + misses {\displaystyle {\frac {\text{hits}}{{\text{hits}}+{\text{misses}}}}} . For example, at threshold 74, it
8712-418: The sensitivity and specificity for the test can be calculated. If it turns out that the sensitivity is high then any person who has the disease is likely to be classified as positive by the test. On the other hand, if the specificity is high, any person who does not have the disease is likely to be classified as negative by the test. An NIH web site has a discussion of how these ratios are calculated. Consider
8811-518: The signal and the noise distributions, compared against the standard deviation of the noise distribution. For normally distributed signal and noise with mean and standard deviations μ S {\displaystyle \mu _{S}} and σ S {\displaystyle \sigma _{S}} , and μ N {\displaystyle \mu _{N}} and σ N {\displaystyle \sigma _{N}} , respectively, d′
8910-559: The single parameterization or single system case) all have the advantage that 0 represents chance performance whilst 1 represents perfect performance, and −1 represents the "perverse" case of full informedness always giving the wrong response. Bringing chance performance to 0 allows these alternative scales to be interpreted as Kappa statistics. Informedness has been shown to have desirable characteristics for Machine Learning versus other common definitions of Kappa such as Cohen Kappa and Fleiss Kappa . Sometimes it can be more useful to look at
9009-407: The slope will be 1.0. If the standard deviation of the target strength distribution is larger than the standard deviation of the lure strength distribution, then the slope will be smaller than 1.0. In most studies, it has been found that the zROC curve slopes constantly fall below 1, usually between 0.5 and 0.9. Many experiments yielded a zROC slope of 0.8. A slope of 0.8 implies that the variability of
9108-413: The smallest amount of substance in a sample that can accurately be measured by an assay (synonymously to detection limit ), and "analytical specificity" is defined as the ability of an assay to measure one particular organism or substance, rather than others. However, this article deals with diagnostic sensitivity and specificity as defined at top. Imagine a study evaluating a test that screens people for
9207-540: The storage of the data can be done in a smarter way (see Lauritzen (2002)). In order to do this one can use information theory concepts, which gain the information only from the distribution of probability, which can be expressed easily from the contingency table by the relative frequencies. A pivot table is a way to create contingency tables using spreadsheet software. Suppose there are two variables, sex (male or female) and handedness (right- or left-handed). Further suppose that 100 individuals are randomly sampled from
9306-435: The target strength distribution is 25% larger than the variability of the lure strength distribution. True positive rate In medicine and statistics , sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do not are considered "negative", then sensitivity
9405-428: The test 100% sensitivity. However, sensitivity does not take into account false positives. The bogus test also returns positive on all healthy patients, giving it a false positive rate of 100%, rendering it useless for detecting or "ruling in" the disease. The calculation of sensitivity does not take into account indeterminate test results. If a test cannot be repeated, indeterminate samples either should be excluded from
9504-407: The varying parameter. For example, imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g / dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (green vertical line in
9603-406: The zROC is not only linear, but has a slope of 1.0. The normal distributions of targets (studied objects that the subjects need to recall) and lures (non studied objects that the subjects attempt to recall) is the factor causing the zROC to be linear. The linearity of the zROC curve depends on the standard deviations of the target and lure strength distributions. If the standard deviations are equal,
9702-554: Was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904. A crucial problem of multivariate statistics is finding the (direct-)dependence structure underlying the variables contained in high-dimensional contingency tables. If some of the conditional independences are revealed, then even
9801-400: Was soon introduced to psychology to account for the perceptual detection of stimuli. ROC analysis has been used in medicine , radiology , biometrics , forecasting of natural hazards , meteorology , model performance assessment, and other areas for many decades and is increasingly used in machine learning and data mining research. A classification model ( classifier or diagnosis )
#417582