Labeling
In this project, we augmented a manually created data set with a training set that was bootstrapped from the annotations created by idealo’s rule based algorithm. Thus, we obtained two kinds of labelled data: strong and weak labels.
To acquire the strong labels we manually assigned relevant information (so called attribute values, e.g. “1 cm”) within a product description in a merchant’s catalogue (a so called offer text) to the respective attribute (e.g. “length”).
Since various products are already contained in idealo’s product catalogue, we could use that information to automatically create millions of additional labels. These automatically generated labels are used as additional training data for our models. Despite providing a tremendous increase in labelled data we refer to these labels as weak, since they are not created by human labellers and are thus prone to incompleteness and may also contain errors, such as False Positives.
Models
Our two best performing models each follow a well-known problem setting in Natural Language Processing (NLP): While the first model aims to perform a Semantic Segmentation with respect to all classes (all attributes plus one non-attribute background class) at once, the second model engages in a Question-Answering style in order to “ask” for each single attribute. Both of our models use the state-of-the-art approach in the realm of NLP: Bidirectional Encoder Representations from Transformers (BERT) architectures (blog article, paper).
BERTseg model