According to the Gartner survey: “Through 2020, 80% of AI projects will remain alchemy, run by wizards whose talents will not scale in the organisation.” One of the primary reasons for the lack of understanding of machine learning in corporate environments is the detachment of business value and statistics. In my company, we championed a simple framework that enabled all parties to develop a shared language to reason about the various scenarios around the topic.

Business leaders care about money, and DSes care about statistical properties. Connecting these two are the critical elements of the framework. The underlying common cause of both of these metrics are the same: errors and their costs. Statistical models make two types of mistakes: False Positives (FP) and False Negatives (FN). False Positive is when something is mistakenly selected, and False Negative is when something is missed. The fundamental recognition was that the monetary value (or the risk of committing) of these two types of errors are not the same. The cost of a False Positive error is the extra effort (labour cost) to handle the incorrectly selected item. In contrast, the cost of a False Negative is the reputation risk coming from the incompleteness of the resulting product.

Let’s take a simple example to make this concrete: the case when an analyst is searching for relevant content for clients. Previously selected content form a declarative problem definition. These problems are hard to be resolved imperatively, so this is a good candidate for a machine learning solution. Partial automation would mean that an upstream model labels all content regarding “relevant/not relevant” and the analyst decide from these which ones should go to the clients.

Partial automation usually forms a convenient middle step in machine learning projects as it appears halfway between the manual solution and the fully automated solution so various scenarios can be evaluated in the same framework. What are the two types of errors and their consequences?

False Positives and Precision

This means that the model tagged some content as relevant, but it is actually not relevant. An analyst must read it to make that judgement which costs time and money. If this happens frequently, the analyst might not be able to sieve through all of the content physically, and the product is unfeasible unless more of them are hired. False-positive errors, therefore, primarily cause labour cost. Precision is simply the quality of the selected articles or in other words, the lack of false positives expressed as a percentage.

False Negatives and Recall

The model making a false negative error means that the analysts downstream have no chance of selecting a relevant piece because they won’t see that piece at all. That means the client won’t see it either and therefore damages the quality of the service and potentially stop making it valuable for them. There is no way to solve this other than making better models. Recall is the completeness of the selected articles or in other words the lack of missed items expressed as a percentage.

Turning Actionable

For a given model DSes can set hyperparameters to create a tradeoff between precision and recall. Precision and recall measure different incompatible quantities from a statistical perspective. It is not possible to decide which is the “best” model, that is a purely business question. What is more important for the business: one more hour of analyst work spared or one more slightly disappointed customer? These are not mathematical questions at all.

To aide the business decision, we can draw the various options on a chart:

PR Curve

The chart depicts several scenarios each with different business value:

  • Solution 0:
    • Low precision, 100% recall, everything is selected and analyst have to filter out the content manually. Clearly not practical but listed as a theoretical option given that this ensures that the end-users miss nothing.
  • Solution 1:
    • Higher precision, slightly lower recall, sacrificing a small amount of relevant content, a massive amount of irrelevant content can be filtered out automatically. This frees up the labour that was required to filter that content out and make the first step toward automation
  • Solution 2:
    • Even higher precision, lower recall, with careful consideration further irrelevant content can be thrown out automatically, but this is the point when the lost relevant content must be carefully examined if the system is still acceptable
  • Solution 3:
    • By selecting parameters to achieve higher precision and spare more labour cost, the resulting system misses too much relevant content (too low recall) and creates an unfeasible solution. Where the “Client acceptability limit” lies exactly should be decided by UX and product managers as they represent the end-users in the process.
  • Solution 4:
    • With the current model, no hyperparameter selection can place the system in the position of Solution 4. DS must create a better model to be able to deliver the performance characteristics of this solution.

Once DS can create a better model, the whole tradeoff negotiation between precision and recall must restart as new options appear on the horizon. If a better model is not possible, the best (from a business point of view) should be selected by setting the appropriate hyperparameters. If that still doesn’t yield an economically feasible solution, a go/nogo decision must be made.

Summary

Above we can see a convenient framework linking business value and statistics. The above terminology helps non-tech team members to participate in the clarification of the business case for an end-to-end ML system with a couple of simple terms. Experience shows that an organisation can adopt phrases like “precision sensitive problem” meaning labour cost should be reduced or “Is this a recall issue?” meaning some critical content is missed. This helps all parties to be on the same page what should be done next. The above framework should apply to more complicated modelling problems as well.

Key takeaways

  • False Positive and False Negative errors have fundamentally different business costs and risks
  • For the same model, multiple business options available through precision-recall tradeoff.
  • Selecting from these is a business question and not a mathematical one
  • The shared vocabulary enable business and UX/PM teams to contribute with domain-specific knowledge to DS projects.