Here at DryvIQ, we take pride in our ability to detect sensitive information in documents and images and classify document types. But how exactly do we measure how well our models are doing?
Accuracy is often the first word used in these discussions. You have likely noticed that many AI platform companies claim that their models have excellent accuracy scores. But accuracy as a standalone metric can often be very misleading; in fact, many models that mislabel the majority of sensitive documents as “not containing sensitive information” can have nearly perfect accuracy scores! So how is this possible, and which metrics should a truly “accurate” AI model be evaluated against? Let me explain.
How can AI accuracy be misleading?
In many industries, documents containing sensitive content are a tiny percentage of the collection of all documents owned by a company. Even if 10% of a company’s documents are sensitive, a model that labels EVERY document as “not sensitive” would be correct 90% of the time. This means its accuracy is 90%! If only accuracy is reported, this model sounds like it is performing well when under the hood, a much different story is happening.
As another example, let’s say that a company has 200 resumes in a collection of 100,000 documents. In this case, a model that predicts that every document is something other than a resume has an accuracy of 99.998% for detecting whether or not a document is a resume. This model could be rigged to have great accuracy on every document type that doesn’t make up a large percentage of the total. (This is also a horrible model!)
What should we look for instead?
Since we can be easily fooled by models with high accuracy, we need another way to measure how well a model is performing. We will introduce two metrics here—precision and recall. In short, we use precision when our most important goal is reducing the number of false positives, and recall when that goal is reducing the number of false negatives. If these concepts are unfamiliar to you, don’t worry, they are described in the next section.
If you don’t want to get into the nitty-gritty, here’s a summary of why we use precision and recall to evaluate models here at DryvIQ—
In the case of detecting sensitive content (a binary classification problem, where each document is labeled as yes/no), a high precision implies that few non-sensitive documents were labeled as “sensitive,” and a high recall implies that few sensitive documents were labeled as “not sensitive.” In the case of document classification (a multi-label classification problem, where each document is labeled one of many possible labels), precision and recall are reported as averages of the respective metric over each document type.
How exactly do we calculate precision/recall?
We first need to become familiar with four terms, which we will describe with respect to sensitive document detection:
- True positive: A document that does contain sensitive information, and labeled as “containing sensitive information”
- False positive: A document that does NOT contain sensitive information, but labeled as “containing sensitive information”
- True negative: A document that does NOT contain sensitive information, and labeled as “NOT containing sensitive information”
- False negative: A document that does contain sensitive information, but labeled as “NOT containing sensitive information”
This can be summarized using the table below:
Precision is calculated as:
In the case of sensitive content detection, out of all documents predicted as containing sensitive information, “precision” is the percentage that actually did.
Recall is calculated as:
In the case of sensitive content detection, “recall” is the percentage of documents containing sensitive content that were predicted as containing sensitive content.
Which metric should you care about?
The short answer: it depends on the task at hand! A model with a high precision returns few false positives, whereas a model with a high recall score has few false negatives, but working with both is a balancing act.
The balancing act behind DryvIQ’s superior accuracy metrics
In a perfect world, we would have perfect scores of 100 for both precision and recall. But, in reality, there is a trade-off. For instance, when looking to detect US phone numbers, we look to find 10 consecutive digits in specific patterns. By restricting those patterns, we can significantly reduce the number of false positives, increasing the precision. But, if we become too restrictive, we also decrease the number of true positives and increase the number of false negatives. Although this would continue to increase precision, it would significantly decrease recall.
So, with this dichotomy, what metric should we look at?
False negatives MUST be minimized – sensitive documents cannot go undetected; thus, here at DryvIQ, recall is of utmost importance. But, we are aware that false positives create a major inconvenience to those reviewing scan results, thus maximizing precision is our second highest priority.
DryvIQ’s models and thresholds are carefully crafted to best balance these metrics; that’s what we mean when we refer to our stellar accuracy.
Schedule a demo to see the DryvIQ platform in action today!