Automatically extract numerical attributes from product descriptions in order to enrich the existing database.
Product description in free form text
Extracted numeric product attributes
Enriched product catalog with more structured information for users
idealo provides a product and price comparison website. For that purpose, idealo aggregates the product catalogues of several thousand partner merchants. To make items comparable across different merchants, a pre-defined and standardized set of product attributes is extracted from all items in the partner catalogs.
Currently, this attribute extraction is accomplished by a rule-based matching procedure. Together with idealo, we implemented and analyzed various machine learning based approaches to replace or enrich the current rule-based procedure. In the first project phase we focused on the extraction and interpretation of numeric attributes.
Due to the large inventory (more than 300 million items) and the many-faceted classification of these items (more than a thousand categories, several thousands of individual attributes) manually labelling representative training and evaluation data sets would be an unreasonable effort.
In this project, we augmented a manually created data set with a training set that was bootstrapped from the annotations created by idealo’s rule based algorithm. Thus, we obtained two kinds of labelled data: strong and weak labels.
To acquire the strong labels we manually assigned relevant information (so called attribute values, e.g. “1 cm”) within a product description in a merchant’s catalogue (a so called offer text) to the respective attribute (e.g. “length”).
Since various products are already contained in idealo’s product catalogue, we could use that information to automatically create millions of additional labels. These automatically generated labels are used as additional training data for our models. Despite providing a tremendous increase in labelled data we refer to these labels as weak, since they are not created by human labellers and are thus prone to incompleteness and may also contain errors, such as False Positives.
Our two best performing models each follow a well-known problem setting in Natural Language Processing (NLP): While the first model aims to perform a Semantic Segmentation with respect to all classes (all attributes plus one non-attribute background class) at once, the second model engages in a Question-Answering style in order to “ask” for each single attribute. Both of our models use the state-of-the-art approach in the realm of NLP: Bidirectional Encoder Representations from Transformers (BERT) architectures (blog article, paper).
The first model is a BERT segmentation model (BERTSeg). As you see in the image above, the model classifies each “text piece” (also known as token) into attributes - or non-attribute background - using a classification layer which is trained for each product category. While the category specific layer has to be learned from the labelled data, we profit from a pretrained BERT encoder, which already provides a semantic “understanding” of the different parts of the given texts. After training on the weak and strong labelled data we can segment the texts in different categories, as seen in the picture below.
Our second BERT model uses a question-answer approach (BERTQ+A). Again, we make use of a pretrained BERT architecture and fine-tune the model to our specific use case. Here, we use a separate BERT embedding for each attribute-offer text pair (as shown in the image below the attribute “Spannung (V)” is concatenated with an offer text “Steckdose 220 Volt [...]”).
For each “text piece” (or token) and attribute, the model answers the yes/no question “Is this token an attribute value of this attribute?”. This answer is given in the form of a binary probability that the token represents a corresponding attribute value.
The fact that the attribute name (including its unit) influences the embedding of offer texts allows for basic generalization to unseen attributes.
A typical output of the BERTQ+A model is presented below, where each attribute was identified separately. The BERTQ+A model is more powerful on this task compared to the BERTSeg model, however, due to the fact that a separate prediction is needed for every attribute-offer pair, it is also slower.
The extraction performance of our solution surpasses the performance of the existing rule-based matching procedure. While we match the precision of the current system, we exceed it in terms of recall by a large margin.
This is a significant achievement, as rule-based procedures are very strict and therefore inherently have a high precision.
The fact that the recall of our BERT-based solution is higher than the rule-base approach showcases the big advantage of Machine Learning based methods in the field of Information Extraction: There is no need to explicitly define every edge case which needs to be extracted. Instead, attribute values can be identified by their context and consequently more values can be found.
This leads to an enriched product catalog with more structured information available for users.