In statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as positive predictive value , and recall is also known as sensitivity in diagnostic binary classification.
66-423: The F 1 score is the harmonic mean of the precision and recall. It thus symmetrically represents both precision and recall in one metric. The more generic F β {\displaystyle F_{\beta }} score applies additional weights, valuing one of precision or recall more than the other. The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and
132-445: A right triangle with legs a and b and altitude h from the hypotenuse to the right angle, h is half the harmonic mean of a and b . Let t and s ( t > s ) be the sides of the two inscribed squares in a right triangle with hypotenuse c . Then s equals half the harmonic mean of c and t . Let a trapezoid have vertices A, B, C, and D in sequence and have parallel sides AB and CD. Let E be
198-431: A 2×2 contingency table , with rows corresponding to actual value – condition positive or condition negative – and columns corresponding to classification value – test outcome positive or test outcome negative. From tallies of the four basic outcomes, there are many approaches that can be used to measure the accuracy of a classifier or predictor. Different fields have different preferences. A common approach to evaluation
264-645: A balanced measurement of performance. Macro F1 is a macro-averaged F1 score aiming at a balanced performance measurement. To calculate macro F1, two different averaging-formulas have been used: the F1 score of (arithmetic) class-wise precision and recall means or the arithmetic mean of class-wise F1 scores, where the latter exhibits more desirable properties. Micro F1 is the harmonic mean of micro precision (number of correct predictions normalized by false positives) and micro recall (number of correct predictions normalized by false negatives). Since in multi-class evaluation
330-441: A flow chart for determining which pair of indicators should be used when. Otherwise, there is no general rule for deciding. There is also no general agreement on how the pair of indicators should be used to decide on concrete questions, such as when to prefer one classifier over another. One can take ratios of a complementary pair of ratios, yielding four likelihood ratios (two column ratio of ratios, two row ratio of ratios). This
396-406: A form of dichotomization in which a continuous function is transformed into a binary variable. Tests whose results are of continuous values, such as most blood values , can artificially be made binary by defining a cutoff value , with test results being designated as positive or negative depending on whether the resultant value is higher or lower than the cutoff. However, such conversion causes
462-449: A loss of information, as the resultant binary classification does not tell how much above or below the cutoff a value is. As a result, when converting a continuous value that is close to the cutoff to a binary one, the resultant positive or negative predictive value is generally higher than the predictive value given directly from the continuous value. In such cases, the designation of the test of being either positive or negative gives
528-547: A positive real factor β {\displaystyle \beta } , where β {\displaystyle \beta } is chosen such that recall is considered β {\displaystyle \beta } times as important as precision, is: In terms of Type I and type II errors this becomes: Two commonly used values for β {\displaystyle \beta } are 2, which weighs recall higher than precision, and 0.5, which weighs recall lower than precision. The F-measure
594-402: A range of cars one measure will produce the harmonic mean of the other – i.e., converting the mean value of fuel economy expressed in litres per 100 km to miles per gallon will produce the harmonic mean of the fuel economy expressed in miles per gallon. For calculating the average fuel consumption of a fleet of vehicles from the individual fuel consumptions, the harmonic mean should be used if
660-429: A roundtrip journey (see above). In any triangle , the radius of the incircle is one-third of the harmonic mean of the altitudes . For any point P on the minor arc BC of the circumcircle of an equilateral triangle ABC, with distances q and t from B and C respectively, and with the intersection of PA and BC being at a distance y from point P, we have that y is half the harmonic mean of q and t . In
726-414: A set of non-identical numbers is subjected to a mean-preserving spread — that is, two or more elements of the set are "spread apart" from each other while leaving the arithmetic mean unchanged — then the harmonic mean always decreases. For the special case of just two numbers, x 1 {\displaystyle x_{1}} and x 2 {\displaystyle x_{2}} ,
SECTION 10
#1732772694555792-444: A speed y , then its average speed is the arithmetic mean of x and y , which in the above example is 40 km/h. Average speed for the entire journey = Total distance traveled / Sum of time for each segment = xt+yt / 2t = x+y / 2 The same principle applies to more than two segments: given a series of sub-trips at different speeds, if each sub-trip covers
858-416: A wall at height A and the other leaning against the opposite wall at height B , as shown. The ladders cross at a height of h above the alley floor. Then h is half the harmonic mean of A and B . This result still holds if the walls are slanted but still parallel and the "heights" A , B , and h are measured as distances from the floor along lines parallel to the walls. This can be proved easily using
924-463: A weighted arithmetic mean, high data points are given greater weights than low data points. The weighted harmonic mean, on the other hand, correctly weights each data point. The simple weighted arithmetic mean when applied to non-price normalized ratios such as the P/E is biased upwards and cannot be numerically justified, since it is based on equalized earnings; just as vehicles speeds cannot be averaged for
990-432: Is It is the reciprocal of the arithmetic mean of the reciprocals, and vice versa: where the arithmetic mean is A ( x 1 , x 2 , … , x n ) = 1 n ∑ i = 1 n x i . {\textstyle A(x_{1},x_{2},\ldots ,x_{n})={\tfrac {1}{n}}\sum _{i=1}^{n}x_{i}.} The harmonic mean
1056-467: Is a Schur-concave function, and is greater than or equal to the minimum of its arguments: for positive arguments, min ( x 1 … x n ) ≤ H ( x 1 … x n ) ≤ n min ( x 1 … x n ) {\displaystyle \min(x_{1}\ldots x_{n})\leq H(x_{1}\ldots x_{n})\leq n\min(x_{1}\ldots x_{n})} . Thus,
1122-561: Is also used in machine learning . However, the F-measures do not take true negatives into account, hence measures such as the Matthews correlation coefficient , Informedness or Cohen's kappa may be preferred to assess the performance of a binary classifier. The F-score has been widely used in the natural language processing literature, such as in the evaluation of named entity recognition and word segmentation . The F 1 score
1188-418: Is equal to 2.4 hours, to drain the pool together. This is one-half of the harmonic mean of 6 and 4: 2·6·4 / 6 + 4 = 4.8 . That is, the appropriate average for the two types of pump is the harmonic mean, and with one pair of pumps (two pumps), it takes half this harmonic mean time, while with two pairs of pumps (four pumps) it would take a quarter of this harmonic mean time. In hydrology ,
1254-438: Is equivalent to two thin lenses of focal length f hm , their harmonic mean, in series. Expressed as optical power , two thin lenses of optical powers P 1 and P 2 in series is equivalent to two thin lenses of optical power P am , their arithmetic mean, in series. The weighted harmonic mean is the preferable method for averaging multiples, such as the price–earnings ratio (P/E). If these ratios are averaged using
1320-447: Is found, invert it so as to find the "true" average trip speed. For each trip segment i, the slowness s i = 1/speed i . Then take the weighted arithmetic mean of the s i 's weighted by their respective distances (optionally with the weights normalized so they sum to 1 by dividing them by trip length). This gives the true average slowness (in time per kilometre). It turns out that this procedure, which can be done with no knowledge of
1386-638: Is met by the P4 metric definition, which is sometimes indicated as a symmetrical extension of F 1 . While the F-measure is the harmonic mean of recall and precision, the Fowlkes–Mallows index is their geometric mean . The F-score is also used for evaluating classification problems with more than two classes ( Multiclass classification ). A common method is to average the F-score over each class, aiming at
SECTION 20
#17327726945551452-421: Is primarily done for the column (condition) ratios, yielding likelihood ratios in diagnostic testing . Taking the ratio of one of these groups of ratios yields a final ratio, the diagnostic odds ratio (DOR). This can also be defined directly as (TP×TN)/(FP×FN) = (TP/FN)/(FP/TN); this has a useful interpretation – as an odds ratio – and is prevalence-independent. There are a number of other metrics, most simply
1518-443: Is problematic. One way to address this issue (see e.g., Siblini et al., 2020 ) is to use a standard class ratio r 0 {\displaystyle r_{0}} when making such comparisons. The F-score is often used in the field of information retrieval for measuring search , document classification , and query classification performance. It is particularly relevant in applications which are primarily concerned with
1584-427: Is related to the field of binary classification where recall is often termed "sensitivity". Precision-recall curve, and thus the F β {\displaystyle F_{\beta }} score, explicitly depends on the ratio r {\displaystyle r} of positive to negative test cases. This means that comparison of the F-score across different problems with differing class ratios
1650-423: Is related to the other Pythagorean means, as seen in the equation below. This can be seen by interpreting the denominator to be the arithmetic mean of the product of numbers n times but each time omitting the j -th term. That is, for the first term, we multiply all n numbers except the first; for the second, we multiply all n numbers except the second; and so on. The numerator, excluding the n , which goes with
1716-486: Is the Dice coefficient of the set of retrieved items and the set of relevant items. David Hand and others criticize the widespread use of the F 1 score since it gives equal importance to precision and recall. In practice, different types of mis-classifications incur different costs. In other words, the relative importance of precision and recall is an aspect of the problem. According to Davide Chicco and Giuseppe Jurman,
1782-595: Is to begin by computing two ratios of a standard pattern. There are eight basic ratios of this form that one can compute from the contingency table, which come in four complementary pairs (each pair summing to 1). These are obtained by dividing each of the four numbers by the sum of its row or column, yielding eight numbers, which can be referred to generically in the form "true positive row ratio" or "false negative column ratio". There are thus two pairs of column ratios and two pairs of row ratios, and one can summarize these with four numbers by choosing one ratio from each pair –
1848-538: Is used to categorize new probabilistic observations into said categories. When there are only two categories the problem is known as statistical binary classification. Some of the methods commonly used for binary classification are: Each classifier is best in only a select domain based upon the number of observations, the dimensionality of the feature vector , the noise in the data and many other factors. For example, random forests perform better than SVM classifiers for 3D point clouds. Binary classification may be
1914-426: The Matthews correlation coefficient . Other metrics include Youden's J statistic , the uncertainty coefficient , the phi coefficient , and Cohen's kappa . Statistical classification is a problem studied in machine learning in which the classification is performed on the basis of a classification rule . It is a type of supervised learning , a method of machine learning where the categories are predefined, and
1980-505: The accuracy or Fraction Correct (FC), which measures the fraction of all instances that are correctly categorized; the complement is the Fraction Incorrect (FiC). The F-score combines precision and recall into one number via a choice of weighing, most simply equal weighing, as the balanced F-score ( F1 score ). Some metrics come from regression coefficients : the markedness and the informedness , and their geometric mean ,
2046-840: The arithmetic mean is always the greatest of the three and the geometric mean is always in between. (If all values in a nonempty data set are equal, the three means are always equal.) It is the special case M −1 of the power mean : H ( x 1 , x 2 , … , x n ) = M − 1 ( x 1 , x 2 , … , x n ) = n x 1 − 1 + x 2 − 1 + ⋯ + x n − 1 {\displaystyle H\left(x_{1},x_{2},\ldots ,x_{n}\right)=M_{-1}\left(x_{1},x_{2},\ldots ,x_{n}\right)={\frac {n}{x_{1}^{-1}+x_{2}^{-1}+\cdots +x_{n}^{-1}}}} Since
F-score - Misplaced Pages Continue
2112-441: The arithmetic mean of the reciprocals of the numbers, that is, the generalized f-mean with f ( x ) = 1 x {\displaystyle f(x)={\frac {1}{x}}} . For example, the harmonic mean of 1, 4, and 4 is The harmonic mean H of the positive real numbers x 1 , x 2 , … , x n {\displaystyle x_{1},x_{2},\ldots ,x_{n}}
2178-414: The gene pool limiting the genetic variation present in the population for many generations to come. When considering fuel economy in automobiles two measures are commonly used – miles per gallon (mpg), and litres per 100 km. As the dimensions of these quantities are the inverse of each other (one is distance per volume, the other volume per distance) when taking the mean value of the fuel economy of
2244-595: The inequality of arithmetic and geometric means , this shows for the n = 2 case that H ≤ G (a property that in fact holds for all n ). It also follows that G = A H {\displaystyle G={\sqrt {AH}}} , meaning the two numbers' geometric mean equals the geometric mean of their arithmetic and harmonic means. For the special case of three numbers, x 1 {\displaystyle x_{1}} , x 2 {\displaystyle x_{2}} and x 3 {\displaystyle x_{3}} ,
2310-481: The recall (true positives per real positive) is often used as an aggregated performance score for the evaluation of algorithms and systems: the F-score (or F-measure). This is used in information retrieval because only the positive class is of relevance , while number of negatives, in general, is large and unknown. It is thus a trade-off as to whether the correct positive predictions should be measured in relation to
2376-458: The weighted harmonic mean is defined by The unweighted harmonic mean can be regarded as the special case where all of the weights are equal. The prime number theorem states that the number of primes less than or equal to n {\displaystyle n} is asymptotically equal to the harmonic mean of the first n {\displaystyle n} natural numbers . In many situations involving rates and ratios ,
2442-466: The F 1 score is less truthful and informative than the Matthews correlation coefficient (MCC) in binary evaluation classification. David M W Powers has pointed out that F 1 ignores the True Negatives and thus is misleading for unbalanced classes, while kappa and correlation measures are symmetric and assess both directions of predictability - the classifier predicting the true class and
2508-636: The alloy (exclusive of typically minor volume changes due to atom packing effects) is the weighted harmonic mean of the individual densities, weighted by mass, rather than the weighted arithmetic mean as one might at first expect. To use the weighted arithmetic mean, the densities would have to be weighted by volume. Applying dimensional analysis to the problem while labeling the mass units by element and making sure that only like element-masses cancel makes this clear. If one connects two electrical resistors in parallel, one having resistance x (e.g., 60 Ω ) and one having resistance y (e.g., 40 Ω), then
2574-407: The appearance of an inappropriately high certainty, while the value is in fact in an interval of uncertainty. For example, with the urine concentration of hCG as a continuous value, a urine pregnancy test that measured 52 mIU/ml of hCG may show as "positive" with 50 mIU/ml as cutoff, but is in fact in an interval of uncertainty, which may be apparent only by knowing the original continuous value. On
2640-438: The area formula of a trapezoid and area addition formula. In an ellipse , the semi-latus rectum (the distance from a focus to the ellipse along a line parallel to the minor axis) is the harmonic mean of the maximum and minimum distances of the ellipse from a focus. In computer science , specifically information retrieval and machine learning , the harmonic mean of the precision (true positives per predicted positive) and
2706-1604: The arithmetic mean, is the geometric mean to the power n . Thus the n -th harmonic mean is related to the n -th geometric and arithmetic means. The general formula is H ( x 1 , … , x n ) = ( G ( x 1 , … , x n ) ) n A ( x 2 x 3 ⋯ x n , x 1 x 3 ⋯ x n , … , x 1 x 2 ⋯ x n − 1 ) = ( G ( x 1 , … , x n ) ) n A ( 1 x 1 ∏ i = 1 n x i , 1 x 2 ∏ i = 1 n x i , … , 1 x n ∏ i = 1 n x i ) . {\displaystyle H\left(x_{1},\ldots ,x_{n}\right)={\frac {\left(G\left(x_{1},\ldots ,x_{n}\right)\right)^{n}}{A\left(x_{2}x_{3}\cdots x_{n},x_{1}x_{3}\cdots x_{n},\ldots ,x_{1}x_{2}\cdots x_{n-1}\right)}}={\frac {\left(G\left(x_{1},\ldots ,x_{n}\right)\right)^{n}}{A\left({\frac {1}{x_{1}}}{\prod \limits _{i=1}^{n}x_{i}},{\frac {1}{x_{2}}}{\prod \limits _{i=1}^{n}x_{i}},\ldots ,{\frac {1}{x_{n}}}{\prod \limits _{i=1}^{n}x_{i}}\right)}}.} If
F-score - Misplaced Pages Continue
2772-429: The duration of that portion, while for the harmonic mean, the corresponding weight is the distance. In both cases, the resulting formula reduces to dividing the total distance by the total time.) However, one may avoid the use of the harmonic mean for the case of "weighting by distance". Pose the problem as finding "slowness" of the trip where "slowness" (in hours per kilometre) is the inverse of speed. When trip slowness
2838-489: The effect is the same as if one had used two resistors with the same resistance, both equal to the harmonic mean of x and y (48 Ω): the equivalent resistance, in either case, is 24 Ω (one-half of the harmonic mean). This same principle applies to capacitors in series or to inductors in parallel. However, if one connects the resistors in series, then the average resistance is the arithmetic mean of x and y (50 Ω), with total resistance equal to twice this,
2904-410: The entire journey = Total distance traveled / Sum of time for each segment = 2 d / d / x + d / y = 2 / 1 / x + 1 / y However, if the vehicle travels for a certain amount of time at a speed x and then the same amount of time at
2970-508: The fleet uses miles per gallon, whereas the arithmetic mean should be used if the fleet uses litres per 100 km. In the USA the CAFE standards (the federal automobile fuel consumption standards) make use of the harmonic mean. In chemistry and nuclear physics the average mass per particle of a mixture consisting of different species (e.g., molecules or isotopes) is given by the harmonic mean of
3036-512: The harmonic mean can be written as: In this special case, the harmonic mean is related to the arithmetic mean A = x 1 + x 2 2 {\displaystyle A={\frac {x_{1}+x_{2}}{2}}} and the geometric mean G = x 1 x 2 , {\displaystyle G={\sqrt {x_{1}x_{2}}},} by Since G A ≤ 1 {\displaystyle {\tfrac {G}{A}}\leq 1} by
3102-538: The harmonic mean can be written as: Three positive numbers H , G , and A are respectively the harmonic, geometric, and arithmetic means of three positive numbers if and only if the following inequality holds If a set of weights w 1 {\displaystyle w_{1}} , ..., w n {\displaystyle w_{n}} is associated to the data set x 1 {\displaystyle x_{1}} , ..., x n {\displaystyle x_{n}} ,
3168-408: The harmonic mean cannot be made arbitrarily large by changing some values to bigger ones (while having at least one value unchanged). The harmonic mean is also concave for positive arguments, an even stronger property than Schur-concavity. For all positive data sets containing at least one pair of nonequal values , the harmonic mean is always the least of the three Pythagorean means, while
3234-489: The harmonic mean is similarly used to average hydraulic conductivity values for a flow that is perpendicular to layers (e.g., geologic or soil) - flow parallel to layers uses the arithmetic mean. This apparent difference in averaging is explained by the fact that hydrology uses conductivity, which is the inverse of resistivity. In sabermetrics , a baseball player's Power–speed number is the harmonic mean of their home run and stolen base totals. In population genetics ,
3300-411: The harmonic mean is used when calculating the effects of fluctuations in the census population size on the effective population size. The harmonic mean takes into account the fact that events such as population bottleneck increase the rate genetic drift and reduce the amount of genetic variation in the population. This is a result of the fact that following a bottleneck very few individuals contribute to
3366-421: The harmonic mean of a list of numbers tends strongly toward the least elements of the list, it tends (compared to the arithmetic mean) to mitigate the impact of large outliers and aggravate the impact of small ones. The arithmetic mean is often mistakenly used in places calling for the harmonic mean. In the speed example below for instance, the arithmetic mean of 40 is incorrect, and too big. The harmonic mean
SECTION 50
#17327726945553432-462: The harmonic mean provides the correct average . For instance, if a vehicle travels a certain distance d outbound at a speed x (e.g. 60 km/h) and returns the same distance at a speed y (e.g. 20 km/h), then its average speed is the harmonic mean of x and y (30 km/h), not the arithmetic mean (40 km/h). The total travel time is the same as if it had traveled the whole distance at that average speed. This can be proven as follows: Average speed for
3498-402: The harmonic mean, amounts to the same mathematical operations as one would use in solving this problem by using the harmonic mean. Thus it illustrates why the harmonic mean works in this case. Similarly, if one wishes to estimate the density of an alloy given the densities of its constituent elements and their mass fractions (or, equivalently, percentages by mass), then the predicted density of
3564-408: The individual species' masses weighted by their respective mass fraction. Binary classification Binary classification is the task of classifying the elements of a set into one of two groups (each called class ). Typical binary classification problems include: When measuring the accuracy of a binary classifier, the simplest way is to count the errors. But in the real world often one of
3630-406: The intersection of the diagonals , and let F be on side DA and G be on side BC such that FEG is parallel to AB and CD. Then FG is the harmonic mean of AB and DC. (This is provable using similar triangles.) One application of this trapezoid result is in the crossed ladders problem , where two ladders lie oppositely across an alley, each with feet at the base of one sidewall, with one leaning against
3696-632: The lowest possible value is 0, if precision and recall are zero. The name F-measure is believed to be named after a different F function in Van Rijsbergen's book, when introduced to the Fourth Message Understanding Conference (MUC-4, 1992). The traditional F-measure or balanced F-score ( F 1 score ) is the harmonic mean of precision and recall: A more general F score, F β {\displaystyle F_{\beta }} , that uses
3762-476: The number of predicted positives or the number of real positives, so it is measured versus a putative number of positives that is an arithmetic mean of the two possible denominators. A consequence arises from basic algebra in problems where people or systems work together. As an example, if a gas-powered pump can drain a pool in 4 hours and a battery-powered pump can drain the same pool in 6 hours, then it will take both pumps 6·4 / 6 + 4 , which
3828-486: The other four numbers are the complements. The row ratios are: The column ratios are: In diagnostic testing, the main ratios used are the true column ratios – true positive rate and true negative rate – where they are known as sensitivity and specificity . In informational retrieval, the main ratios are the true positive ratios (row and column) – positive predictive value and true positive rate – where they are known as precision and recall . Cullerne Bown has suggested
3894-468: The overall amount of false positives equals the amount of false negatives, micro F1 is equivalent to Accuracy . Harmonic mean In mathematics , the harmonic mean is a kind of average , one of the Pythagorean means . It is the most appropriate average for ratios and rates such as speeds, and is normally only used for positive arguments. The harmonic mean is the reciprocal of
3960-405: The positive class and where the positive class is rare relative to the negative class. Earlier works focused primarily on the F 1 score, but with the proliferation of large scale search engines, performance goals changed to place more emphasis on either precision or recall and so F β {\displaystyle F_{\beta }} is seen in wide application. The F-score
4026-400: The same distance , then the average speed is the harmonic mean of all the sub-trip speeds; and if each sub-trip takes the same amount of time , then the average speed is the arithmetic mean of all the sub-trip speeds. (If neither is the case, then a weighted harmonic mean or weighted arithmetic mean is needed. For the arithmetic mean, the speed of each portion of the trip is weighted by
SECTION 60
#17327726945554092-422: The sum of x and y (100 Ω). This principle applies to capacitors in parallel or to inductors in series. As with the previous example, the same principle applies when more than two resistors, capacitors or inductors are connected, provided that all are in parallel or all are in series. The "conductivity effective mass" of a semiconductor is also defined as the harmonic mean of the effective masses along
4158-409: The three crystallographic directions. As for other optic equations , the thin lens equation 1 / f = 1 / u + 1 / v can be rewritten such that the focal length f is one-half of the harmonic mean of the distances of the subject u and object v from the lens. Two thin lenses of focal length f 1 and f 2 in series
4224-403: The true class predicting the classifier prediction, proposing separate multiclass measures Informedness and Markedness for the two directions, noting that their geometric mean is correlation. Another source of critique of F 1 is its lack of symmetry. It means it may change its value when dataset labeling is changed - the "positive" samples are named "negative" and vice versa. This criticism
4290-686: The two classes is more important, so that the number of both of the different types of errors is of interest. For example, in medical testing, detecting a disease when it is not present (a false positive ) is considered differently from not detecting a disease when it is present (a false negative ). Given a classification of a specific data set, there are four basic combinations of actual data category and assigned category: true positives TP (correct positive assignments), true negatives TN (correct negative assignments), false positives FP (incorrect positive assignments), and false negatives FN (incorrect negative assignments). These can be arranged into
4356-627: Was derived so that F β {\displaystyle F_{\beta }} "measures the effectiveness of retrieval with respect to a user who attaches β {\displaystyle \beta } times as much importance to recall as precision". It is based on Van Rijsbergen 's effectiveness measure Their relationship is F β = 1 − E {\displaystyle F_{\beta }=1-E} where α = 1 1 + β 2 {\displaystyle \alpha ={\frac {1}{1+\beta ^{2}}}} . This
#554445