Evaluation Metrics

A metric learning reality check



  1. accuracy


Precision \ Recall \ ROC \ AUC

- Performance Measures:

A balanced confusion matrix is better than one that is either one row of numbers and one of zeros, or a column of numbers and a column of zeros. Therefore an algorithm that outputs a lower classification accuracy but has a better confusion matrix wins.

# of Positive predictions divided by the total number of positive class values predicted.

Precision = True Positives / (True Positives + False Positives)

Low can be thought of many false positives.

# of positive predictions divided by the number of positive class values in the test data

Recall (sensitivity) = True Positives / (True Positives + False Negatives)

Low can be thought of many false negatives.

F1 Harmonic Mean Score

F1_Score = 2 * ((Precision * Recall) / (Precision + Recall))

F1 helps select a model based on a balance between precision and recall.

In a multi-class problem, there are many methods to calculate F1, some are more appropriate for balanced data, others are not.

  1. The best link yet - micro macro weighted (macro balanced, micro imbalanced, weighted imbalanced)

  2. Micro is accuracy in multi class


  • Accuracy = (1 – Error) = (TP + TN)/(PP + NP) = Pr(C), the probability of a correct classification.

  • Sensitivity (recall) = TP/(TP + FN) = TP/PP = the ability of the test to detect disease in a population of diseased individuals.

  • Specificity = TN/(TN + FP) = TN / NP = the ability of the test to correctly rule out the disease in a disease-free population.

(What are ?) Sensitivity and specificity against ROC and AUC.

ROC curve and AUC in weka - explains how the curve should look like for the negative or positive predictions, against what is actually plotted.

Mean F1? How do we calculate it.

Multiclass Precision / Recall, part 1

Precision at K, formulas, examples, git 1, git 2, git 3 (suggestive, recommendation application)



  • one day, your girlfriend asks you: ‘Sweetie, do you remember all birthday surprises from me?’

  • This simple question makes your life in danger. To extend your life, you need to recall all 10 surprising events from your memory.

  • So, recall is the ratio of a number of events you can correctly recall to a number of all correct events. If you can recall all 10 events correctly, then, your recall ratio is 1.0 (100%). If you can recall 7 events correctly, your recall ratio is 0.7 (70%).


  • For example, you answers 15 times, 10 events are correct and 5 events are wrong. This means you can recall all events but it’s not so precise.

  • So, precision is the ratio of a number of events you can correctly recall to a number all events you recall (mix of correct and wrong recalls). In other words, it is how precise of your recall.

  • From the previous example (10 real events, 15 answers: 10 correct answers, 5 wrong answers), you get 100% recall but your precision is only 66.67% (10 / 15).

Confusion matrix wise: bottom line is recall (% correct out of positive cases), right column is precision (% of POS predictions) & % accuracy in diagonal

F1 score:

  • conveys the balance between the precision and the recall

  • 2*((precision*recall)/(precision+recall)

Yet another(pretty good) source

Another (bad) source for explaining, precision, recall, accuracy, true positive rate etc.

(How to use precision and recall?) answer by aurelien geron:

  • In a binary classifier, the decision function is the function that produces a score for the positive class.

  • In a logistic regression classifier, that decision function is simply a linear combination of the input features.

  • If that score is greater than some threshold that you choose, then the classifier "predicts" the positive class, or else it predicts the negative class.

  • If you want your model to have high precision (at the cost of a low recall), then you must set the threshold pretty high. This way, the model will only predict the positive class when it is absolutely certain. For example, you may want this if the classifier is selecting videos that are safe for kids: it's better to err on the safe side.

  • Conversely, if you want high recall (at the cost of a low precision) then you must use a low threshold. For example, if the classifier is used to detect intruders in a nuclear plant, then you probably want to detect all actual intruders, even if it means getting a lot of false alarms (called "false positives").

  • If you make a few assumptions about the distribution of the data (i.e., the positive and negative class are separated by a linear boundary plus Gaussian noise), then computing the logistic of the score gives you the probability that the instance belongs to the positive class. A score of 0 corresponds to the 50% probability. So by default, a LogisticClassifier predicts the positive class if it estimates the probability to be greater than 50%. In general, this sounds like a reasonable default threshold, but really it all depends on what you want to do with the classifier.

  • If the assumptions I mentioned above were perfect, then if the Logistic Classifier outputs a probability of X% for an instance, it means there is exactly X% chance that it's positive. But in practice, the assumptions are imperfect, so I try to always make it clear that we are talking about an "estimated probability", not an actual probability.


(RMSE - what is?) - it is important to recall that RMSE has the same unit as the dependent variable (DV). It means that there is no absolute good or bad threshold, however you can define it based on your DV. For a datum which ranges from 0 to 1000, an RMSE of 0.7 is small, but if the range goes from 0 to 1, it is not that small anymore. However, although the smaller the RMSE, the better,

(R^2 vs RMSE) - R-squared is conveniently scaled between 0 and 1, whereas RMSE is not scaled to any particular values. This can be good or bad; obviously R-squared can be more easily interpreted, but with RMSE we explicitly know how much our predictions deviate, on average, from the actual values in the dataset. So in a way, RMSE tells you more.

I also found this video really helpful.

Kappa - measures accuracy while considering imbalanced datasets



Last updated