Project Perseus AI ~ Post #3 ~ Evaluating Models and Learning TensorFlow

By Nathaniel Comstock Moll June 22, 2025

What did I Learn this Week?

This week in my educational journey in machine learning, I learned how to track model metrics, such as accuracy and precision, I learned how another kind of model works, called binary classification, and finally I learned some Tensorflow basics.

Tensorflow is a really neat library created for the python programming language that lets you easily set up and train models using its Keras API. I started going through some videos that taught me the basics of how to import data, compile the model, fit the model, and evaluate the model's effectiveness.

So, I've boiled down everything that I've learned in the past couple weeks into a post that briefly describes the ML concepts I went through and the code I used to practice those concepts.

Let's get started!

Basics of a Binary Classification Model

A binary classification model is one that seeks to categorize data points as one option out of two possible options. For example, it could be used to designate a sample as a "spam email" or a "not spam email." Perhaps a doctor might use one to analyze data and output whether a patients tumor is "malignant" or "not malignant." As you can see there is a negative and a positive relationship with these categories. The categories are represented in the data as either a 0 or a 1.

However, the model usually does not output the words "spam" or "not spam." It will usually output a number between 0 and 1. This number should then be rounded to either 0 and 1. Then a question arises: At what point should the rounding take place? At 0.5? Or should we make sure the model is more confident, say 0.75, before we actually round up to 1. The answer to this question is called the model's Classification Threshold.

The classification threshold is massively important because that's the value that determines how many false positives or false negatives your model makes. An assembly of the true positives, false positives, true negatives, and false negatives can then be made. This table is called the model's confusion matrix. When a dataset contains a lot more true positives than true negatives, the data is considered imbalanced.

To conclude this line of thought, the classification threshold can be adjusted based on whether you want your model to prioritize minimizing false positives or prioritize minimizing false negatives. This is something that is unique to each model and to the use case, where one of the two (false negatives or false positives) may be more detrimental than the other.

Model Metrics

Now that we have some of the basics out of the way, we can move on to discussing how a binary classification model's success is evaluated. How is it that the doctor can trust if the model predicts that a tumor is malignant? Obviously, in this case a false negative is much more dangerous than a false positive. How does the doctor account for that in the model?

To start this off, we're going to discuss accuracy which is super simple:

$$\text{Accuracy} = \frac{\text{Correct Classifications}}{\text{Total Classifications}}$$

It's just how many cases did the model get right out of all the total cases in the dataset?

This is a really simple starting point. Always start with this. But rarely should you end here. Instead, the next metric we can look at is Recall, or the True Positive Rate (TPR):

$$\text{Recall} = \frac{\text{Correctly Classified Actual Positives}}{\text{All Actual Positives}}$$

Recall is the proportion of all actual positives that were correctly classified as actual positives. It's pretty much measuring how successful your model is at getting positives right. Focusing on this metric means that you want your model to correctly classify all of the positives correctly. For instance, if "malignant" is the positive here (labeled as 1) a doctor would want to focus on this metric, since any actual malignant case that the model doesn't recognize could be very dangerous for diagnoses.

Let's now move on to the False Positive Rate (FPR):

$$\text{FPR} = \frac{\text{Incorrectly Classified Actual Negatives}}{\text{All Actual Negatives}}$$

The FPR is a measure of how many actual negatives were classified incorrectly as positives by the model out of all the actual negatives in the dataset. It's like tracking the model's rate of false alarms. If a dataset is imbalanced so that there are very few actual negatives, this metric becomes less useful.

Next up we have Precision:

$$\text{Precision} = \frac{\text{Correctly Classified Actual Positives}}{\text{Everything Classified as Positive}}$$

This just measures how much of everything the model classified as positive are actual positives.

I always got precision a little confused with recall when I was learning these so I figured I would add a clarification to the difference of these.

Recall is a measure of how many positives the model correctly classified out of all of the actual positives in a dataset.

Precision is a measure of how many positives the model correctly classified out of everything else the model classified as positive. It measures how the actual classified positives rank against the incorrectly classified positives (false positives).

Now that we've learned these things, we're ready to discuss the last topic before I move into a more code-heavy section!

ROC Curves and AUC

These terms refer to Receiver-operating Characteristic curves and Area Under the Curve respectively.

The ROC curve is a visual representation of a model's performance at every value of the threshold (from 0.0 to 1.0). It is plotted as True Positive Rate (TPR) vs. False Positive Rate (FPR) so that's TPR as the y-axis and FPR as the x-axis.

The area under an ROC curve represents the probably that the model will correctly classify a positive and negative label chosen at random from the data set. For instance, a model that has a 50% chance of doing this (no better than a coin toss) has an AUC of 0.50 which corresponds to a perfectly diagonal line for the points (0,0) to (1,1). Notice that this line is the line y = x and therefore TPR = FPR.

Models in the real world will have more complex ROC curves and analyzing this graph is an essential part of visualizing how you want to optimize your model for your given use case.

For example, looking at the above ROC curve, if you really needed to limit the false positive rate, you would have to use a smaller threshold value. However, as you go down the line (Which is plotting the threshold) you also decrease the true positive rate, so there are always trade-offs. However, in a use case where false positives are especially detrimental, it would make sense to do this.

Well, okay that may have been a lot, but I hope you learned something from this! Binary classification can seem complex at first but it's actually pretty elegant once you understand the basics. I think that is enough of the theory though! I think it's time to discuss what I coded up in the tensorflow python package this week!

Tensorflow Basics

Instead of going through the code on my blog post, I'm going to give you a link that will take you to a Google CoLab notebook file where I have detailed the process from start to finish:

Binary_classification_example

I hope that this code example helps to solidify some of the concepts I've been going over.

With that, I'd like to thank you for reading this far and I hope that you learned something from it!

See you next time!

Where thoughts orbit stars and dreams power suns...

Search This Blog

Project Perseus