Performance Evaluation Metrics

4 min readApr 8, 2021

Evaluating a machine learning model is as much important as building it, if not more. The only way by which we can know that our model draws out accurate conclusions is by evaluating its performance by passing it through a number of criterion and analyzing the results obtained.

This article aims at providing the readers, a brief idea about the various performance evaluation metrics used. However, before moving on to the main article, let’s have a look at the basic terms and definitions that you’ll need to know so that you can understand this article better.

A true positive is an outcome where the model correctly predicts the positive class. Example: A COVID-19 positive person is correctly predicted to be positive.

A true negative is an outcome where the model correctly predicts the negative class. Example: A COVID-19 negative person is correctly predicted to be negative.

A false positive is an outcome where the model incorrectly predicts the positive class. Example: A COVID-19 negative person is incorrectly predicted to be positive.

A false negative is an outcome where the model incorrectly predicts the negative class. Example: A COVID-19 positive person is incorrectly predicted to be negative.

Classification Accuracy

Classification Accuracy, or simply accuracy, is the ratio of number of correct predictions to the total number of input samples.

In terms of the terms we discussed above, accuracy can be described as:

Classification accuracy often leads to misleading results, especially in the case of unbalanced classes, which is when we have different number of samples in different classes.

Confusion Matrix

A confusion matrix is a technique for summarizing the performance of a classification algorithm. It helps in overcoming the limitations that we faced while using classification accuracy. It not only helps in identifying whether the model is appropriate or not, but also tell us the areas in which the model does not perform well so that we can work on those specific areas.

Error Rate

Error rate (ERR) is calculated as the number of incorrect predictions divided by the total number of samples in the dataset. Ideally, it should be 0. Its maximum value is 1.

Recall (Sensitivity)

Recall is the fraction of true events that you have predicted correctly.
In other words when the actual value is true(positive) then how often the predicted value is correct, is recall. It is calculated as the number of correct positive predictions divided by the total number of positives. In an ideal scenario, it should be 1.

Precision

Precision is the fraction of predicted positive events that are actually positive.
In other words, it means the probability of when our model is predicted true and it is correct prediction. Ideally, its value should be 1.

If we make the precision as high as possible, i.e., almost equal to 1, the recall of our model would decrease because of the high number of false negatives. For some machine learning models, we need both precision and recall to be balanced with each other. For such a scenario, we calculate another metric called F1 Score.

F1-Score

It is a harmonic mean of precision and recall. It is also known as a weighted average of Precision and Recall.

F1-Score= 2*(Recall * Precision) / (Recall + Precision)

Fβ- Score

It is also a metric same F1 score only differs in the case that it does not give equal weight to both precision and recall. It allows us to give more weight to either of them.

Specificity

Specificity (SP) is calculated as the number of correct negative predictions divided by the total number of negatives. It is also called true negative rate (TNR). Ideally, specificity should be 1.

False Positive Rate

False positive rate (FPR) is calculated as the number of incorrect positive predictions divided by the total number of negatives. Its ideal value is 0.