Asset 1
Model Development Best Practices: Data Labeling 101

Data labeling is perhaps the most critical part of any machine learning classification problem. While labeling of structured data often relies mainly on tangible, well-defined and analytical interpretations, labeling of unstructured data, like text, as is common in NLP problems, is more frequently the result of less transparent heuristics coded into human consciousness. As such, successful NLP annotation projects must account for all of the weird and wonderful variation that exists in a labeling team’s thought processes and interpretations of the data as they progress through the annotation project.

Ensuring that the level of complexity encoded in the labels to be applied to text is optimal for both the analytic use-case and business requirements is one way to account for this. By reducing complexity of individual classification problems, say through model localisation, we can reduce the need for deep expertise in our annotators, as well as increase the speed with which they can apply labels.

In line with this, despite simplifying the label definition, it is still important to continuously monitor annotator alignment, since even if the team starts on the same page, there is no guarantee that this will be stable over time and across annotators. In fact, concept drift, the process of changing definitions in the target language, is ubiquitous in data labeling projects.

Some teams will handle this specific problem with minimum annotator agreement thresholds, the effects on model quality of which can be seen in Figure 1. 

Figure 1. Model quality as a function of the number of training positives. Coloured lines represent varying degrees of annotator agreement thresholds for positive sample inclusion, showing that higher thresholds result in higher quality positives, thus a smaller overall number of positives is required.

While enforcing minimum thresholds on annotation acceptance will certainly improve data quality, the level of effort required to source the requisite number of labels is high, since multiple people in the team end up annotating the same data.

Innovations in machine learning can help here though. By enabling annotators to focus on mining for target samples across the maximum volume of data, we can monitor for issues like annotator disagreement or concept drift using semantic mapping solutions, and visualisations of these mappings. One such concept drift monitoring pipeline is illustrated in Figure 2.

Figure 2. Generic semantic mapping visualisation pipeline, illustrating how annotation data sets can be interrogated for issues around semantic concept drift. Visualisation can be designed to capture inter-annotator concept drift by colour coding points by annotator, or to inspect temporal concept drift within and across annotators by capturing annotation timestamps.

In summary, strong subject matter expert (SME) oversight, constant data scientist input, and cutting edge machine learning practice are all essential to successfully delivering any labeling project. By investing effort and resources at this early stage, your chances of eventually delivering a powerful machine learning classification model for your business are much higher.

Written By
Kevin Keenan

Director, Data Science (FinServ)