TechCrunch Shanghai

We enjoyed pitching our TeacherBot chatbot powered by Artificial Intelligence to a room of interested investors and technologists at TechCrunch Shanghai 2017.

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

10 essential ways to evaluate Machine learning model performance

The goal of a machine learning model is to learn patterns that generalize well on unseen data instead of just memorizing the data that it was trained on. When your model is ready, you would use it to predict the answer on the evaluation or test data set and then compare the predicted target to the actual answer (ground truth). This is a typical approach that is followed to evaluate model performance. However this comparison between predicted and actual values is performed based on a number of different metrics. The choice of actual metric depends on the ML problem in hand.

Let’s understand this further based on Classification and Regression problems.

The actual output of many binary classification algorithms is a prediction score (for ex. Logistic regression provides probability score) . The score indicates the model’s confidence that a given observation belongs to a certain class. To make the decision about whether the observation should be classified as positive or negative, as a consumer of this score, you will interpret the score by picking a classification threshold (cut-off) and compare the probability score against it. Any observations with scores higher than the threshold are then predicted as the positive class and scores lower than the threshold are predicted as the negative class.

For ex. probability scores are predicted in the range [0,1]. The score threshold to make the decision of classifying examples as 1 or 0(Positive or Negative class) is generally set by default to be 0.5. However, you can review the implications of choosing different score thresholds and pick an appropriate threshold that matches your business need.

Score Distribution for a Binary Classification Model (Reference — AWS SageMaker)

The predictions now fall into four groups based on the actual known answer and the predicted answer. The below representation of actual and predicted values is termed as Confusion Matrix.

Typical metrics for classification problems are Accuracy, Precision, Recall, False positive rate, F1-measure and these are derived from Confusion Matrix. Each metric measures a different aspect of the predictive model.

Accuracy =(True Positive + True Negative)/(True Positive + True Negative+ False Positive+ False Negative)

2. Precision — measures the fraction of actual positives among those examples that are predicted as positive. It shows the probability that a predicted ‘Yes’ is actually a ‘Yes’. It is a measure of relevancy and shows the percentage of relevant results or Correct positives.

Precision — True Positive/True Positive+False Positive

This metric is useful during medical screening or drug testing.

3. Recall — measures how many actual positives were predicted as positive. It is the probability that an actual ‘Yes’ case is predicted correctly. It is also known as Sensitivity or True positive rate (TPR) or Completeness.

Sensitivity = True Positive/True Positive + False Negative

This metric is particularly useful when detection of False negatives is more important for ex.in case of Fraud Detection. In BFS domain, it is vital to ensure that your model is not classifying an actual fraudulent transaction as non-fraudulent.

4. Specificity — This metric will be helpful when negative class is more important for you. It is also known as True Negative Rate.

Specificity = True Negative/True Negative +False Positive

5. False Positive Rate (FPR) — This term gives you the number of false positives (0s predicted as 1s) divided by the total number of negatives. You can see that the formula for False Positive Rate (FPR) is nothing but (1 — Specificity)

FPR = False Positive/True Negative+False Positive

It is very crucial that you consider the overall business problem you are trying to solve to decide the metric you want to maximize or minimize. Based on the requirement, you can follow the ‘Sensitivity-Specificity’ view or the ‘Precision-Recall’ view.

6. F1 Score — It is the harmonic mean of precision and recall. F1 Score is needed when you want to seek a balance between Precision and Recall and there is an uneven class distribution (large number of Actual Negatives).

F1 score = 2 * (Precision*Recall)/(Precision+Recall)

F1 Score = 2 * True Positive/(2* True Positive)+ False Positive + False Negative

7. ROC Curve — ROC Curves shows tradeoff between the True Positive Rate (TPR or Recall) and the False Positive Rate (FPR). It is established from the formulas above, TPR and FPR are nothing but sensitivity and (1 — specificity), so it can also be looked at as a trade-off between sensitivity and specificity. The plot between true positive rate against the false positive rate is known as the ROC curve. As you can see from below sample graph, for higher values of TPR, you will also have higher values of FPR, which might not be good. So it’s all about finding a balance between these two metrics. A good ROC curve is the one which touches the upper-left corner of the graph; so higher the Area under Curve(AUC) of an ROC curve, the better is your model.

Unlike the process for binary classification problems, you do not need to choose a probability score threshold to make predictions. The predicted answer is the class (i.e., label) with the highest predicted score. In some cases, you might want to use the predicted answer only if it is predicted with a high score. In this case, you might choose a threshold on the predicted scores based on which you will accept the predicted answer or not.

Typical metrics used in multi-class are the same as the metrics used in the binary classification case. The metric is calculated for each class by treating it as a binary classification problem after grouping all the other classes as belonging to the second class. Then the binary metric is averaged over all the classes to get either a macro average (treat each class equally) or weighted average (weighted by class frequency) metric. The confusion matrix for multiple classes is a table that shows each class in the evaluation data and the number or percentage of correct predictions and incorrect predictions. for ex. below is a sample of 0 to 9 digit classification confusion matrix.

For regression tasks, the typical accuracy metrics is root mean square error (RMSE).

9. RMSE — This metrics measure the distance between the predicted numeric target and the actual numeric answer (ground truth) which is referred as residual or error term. It is common practice to review the residuals for regression problems.

Residuals represent the portion of the target that the model is unable to predict. A positive residual indicates that the model is underestimating the target (the actual target is larger than the predicted target). A negative residual indicates an overestimation (the actual target is smaller than the predicted target).

Plotting a histogram of residuals on the evaluation data is one of the way to analyze the error terms. when distributed in a bell shape and centered at zero indicates that the model makes mistakes in a random manner and does not systematically over or under predict any particular range of target values. If the residuals do not form a zero-centered bell shape, there is some structure in the model’s prediction error. for ex. below graph shows that residuals are normally distributed.

R² = 1 — Residual sum of squares(RSS)/Total sum of squares(TSS)

For ex. below values show that 83% variance is explained in test data by the model.

Sample R² value derived using Sklearn library

The aforementioned metrics are few from the vast array of metrics for evaluating model performance.

You might not get a desired predictive model in the first iteration (Read — Machine Learning — Why it is an iterative process?), or you might want to improve your model to get even better predictions. Obtaining a ML model that matches your needs usually involves iterating through the ML process, trying out a few variations and evaluating repeatedly on the selected metrics .To improve performance, you need to bring right balance of Bias and Variance in the model by following these steps — “5 ways to achieve right balance of Bias and Variance in ML model”

Happy Reading !!

TechCrunch Shanghai

10 essential ways to evaluate Machine learning model performance

Add a comment

Related posts:

Just How Fast Is Apple Silicon?

Enzo Mari designer attraverso Michele Mari scrittore.

What I Do to Keep Depression at Bay