One of the key and most overlooked aspect of Machine Learning is data labelling. I wrote about this here before, most recently in “Data Science Risk Categorisation” but as I was collecting my thoughts for our free eBook: Machine Learning Product Manual, I decided to revisit the topic one more time. In this article, I will write about how we organised our labelling efforts and what else did we try that didn’t work. Follow me on Twitter if you would be interested in our new e-book:

What actually Machine Learning is?

It is worth repeating here what industrial machine learning is in principle: Machine Learning is a data-defined product that converts domain knowledge that is hard to express implicitly (with rules) into declarative solutions. The primary benefit of Machine Learning is this declarative nature that separates the problem (the domain knowledge) from the implementation (the statistical model) through labelled data that is acting as an interface between the two.

Labelling in general

Armed with this framework, we can take a look at how labelling is done in a usual setup. The Data Scientist is tasked to automate a problem given an already existing set of data of unknown quality. The goal is full automation at the end of a Waterfall-like process: clean data -> train model -> evaluate model; repeat till success. The Data Scientist works alone and meets with others only at the evaluation phase. The project is not part of the fabric of the enterprise; it is “in the lab”. Interaction with the organisation is minimal, so the project team cannot benefit from the accumulated knowledge of the business over the years.

Enter the Domain Expert

Once we realised the faults of the above process, we proposed a new method to involve Domain Experts (financial analyst that evaluate market conditions at investment banks). The Machine Learning problem was to label articles as relevant to a topic or not, so our customers only need to read content that is relevant for them. First, we create a small initial dataset, and the Domain Experts labelled it according to a pre-agreed specification.

We evaluated if they adhere to the specification by asking them to relabel a small portion and check if the new labels differ from the original ones. If it does, it can mean two of the following situations: The specification is not objective enough, or the labeller interprets the specification differently. We held several clarification meetings where we discussed items together that were previously labelled in two different ways. The specification itself was a written document that we kept updating to keep the labelling objective.

Once this initial labelling phase was done, the model was trained and deployed. But because of the complicatedness of the task and the small amount of labelled content, the model didn’t perform well enough.

To deliver an end-to-end solution, we employed an augmented process instead of an automated one. We set the model threshold lower to identify content as relevant (more on this in “How to Connect Data Science to Business Value"). This resulted in too many False Positive (ones that we say are relevant but actually not) but few False Negatives (articles we missed). The Domain Experts needed to evaluate the positive articles by hand. Luckily this amount of content was manageable. We recorded their interactions and used it as an extra labelled set.

Labelling Bias from Incentives

When we used the newly labelled data, the model performance was worse than we expected from the larger dataset. We spot-checked the labelling, and we realised that the Domain Experts have a different specification in production than in the labelling. This is because, in production, they need to improve user experience and provide content that maximises that. For example, the same duplicated article was label negative just because previously another one was labelled positive, and the domain expert didn’t want to send this article as well to the user. This made the labels in the dataset biased, and the model was unable to distil the proper rules to identify the positive articles.

Enter the Data Acquisition Team

It was clear that we made a mistake by integrating the task of delivering to the user and selecting relevant contents. These two issues must be separated, and therefore we formed the Data Acquisition Team. This team has only one task: continuously create the dataset that enabled the product to deliver the best user experience. Because the Data Acquisition Team can only influence the product through the machine learning model, the Domain Experts incentives did not affect their actions.

In the new process, the Data Scientists worked with the Domain Experts to create the specification and trained the Data Acquisition Team how to implement it. The Data Acquisition Team worked together with the Data Science Team to create the initial set of data, fine-tune the specification, and estimate the initial and ongoing data requirements for the model to maintain the specified performance standards.

After the model was deployed, the Data Acquisition Team was consulting the Domain Team on the False Positives (and also the False Negatives rate, which is a much harder problem, see “How to Connect Data Science to Business Value") and together they suggested features to improve the models. Data Science Teams implemented these and requested labelling from the Data Acquisition Team. Meanwhile, the Domain Team was doing their work selecting articles for their clients.


We gained two important benefits by separating the Domain Team and the Data Acquisition Team:

  • We were able to deploy an MVP model fast and maintain continuous delivery of successive improvements.
  • Separation of Concern was improved:
    • The Domain Team could concentrate on serving the clients and passing feedback to the Data Acquisition Team.
    • The Data Acquisition Team could improve the objective performance of the model.

This and many more other techniques will be in our free eBook: Machine Learning Product Manual. Please follow me on Twitter for updates on the topic: @xLaszlo.