BERT for question answering (Part 2)

Mattes Mollenhauer (PhD)

Question marks sampling questions and the exclamation point as sample for the answer

In the first part of this blog post we saw how the BERT architecture works in general. Now we are going to examine a particular practical application of BERT: automated question answering.

Bert for question answering: SQuAD

The SQuAD dataset is a benchmark problem for text comprehension and question answering models.

There are two mainly used versions: There is SQuAD 1.0/1.1, which consists of 100 000 questions related to snippets of 500 Wikipedia articles containing the answer to the individual questions. The data is labeled, i.e., each question is annotated with corresponding text span (begin of the answer - end of the answer) associated with the corresponding text snippet.

SQuAD 2.0 contains the question/answer pairs from SQuAD 1.0 with additional ~50 000 questions which are labeled as unanswerable with respect to the corresponding target text. Instead of only finding the correct answer span, the model now needs to assess whether there is a reasonable answer present in the text and then deduce the textual span of the answer.

An excerpt of the data:

"question": "When were the Normans in Normandy?",
    "id": "56ddde6b9a695914005b9629",
    "answers": [
        "text": "10th and 11th centuries",
        "answer_start": 94
        "text": "in the 10th and 11th centuries",
        "answer_start": 87
        "text": "10th and 11th centuries",
        "answer_start": 94
        "text": "10th and 11th centuries",
        "answer_start": 94

is an example question with potential answers to the context given by

"context": "The Normans (Norman: Nourmands; French:
Normands; Latin: Normanni) were the people who in the
10th and 11th centuries gave their name to Normandy, 
a region in France. They were descended from Norse 
(\"Norman\" comes from \"Norseman\") raiders and 
pirates from Denmark, Iceland and Norway who,
under their leader Rollo, agreed to swear fealty to
King Charles III of West Francia. Through generations
of assimilation and mixing with the native Frankish
and Roman-Gaulish populations, 
their descendants would gradually merge with the 
Carolingian-based cultures of West Francia. The
distinct cultural and ethnic identity of the Normans
emerged initially in the first half of the 10th
century, and it continued to evolve over the
succeeding centuries."

Note that multiple (partly redundant) answers can be given.

The original BERT paper proposed both SQuAD 1.1 as well as SQuAD 2.0 as tasks.

We first address the simpler task of determining the answer without an "impossibility" classification beforehand. Both the question (segment $$A$$) as well as the context data (segment $$B$$) are fed to the model separated by the [SEP] token. Vector representations of the starting and ending token in the hidden layer are denoted by $$S \in \mathbb{R}^H$$ and $$E \in \mathbb{R}^H$$. For each of the tokens $${t_1, \dots, t_l}$$ in segment $$B$$, the hidden vector representation $$T_i \in \mathbb{R}^H$$ is computed. Start and end probabilities are determined via softmax functions as

$$P_i^S := \frac{\exp(S \cdot T_i)}{ \sum_j \exp(S \cdot T_j) },\; Pi^E := \frac{\exp(E \cdot T_i)}{ \sum_j \exp(E \cdot T_j) }.$$

The training iteration now takes place by maximizing the objective

$$\sum_k (\log(P^S_{\xi_k}) + \log(P^E_{\zeta_k})),$$

where $$\xi_k$$ and $$\zeta_k$$ are the indices of correct starting and end index of example $$k$$ in the chosen batch.

On inference time, a prediction of start and end index candidates happens with respect to the optimization problem

$$\max_{i,j} (S \cdot T_i + E \cdot T_j)$$

subject to the constraint $$i < j$$: the optimal indices under this score are returned as final start and endpoints.

In the case of SQuAD 2.0 it needs to be determined whether the question is answerable before performing the inference as above. To this end, the [CLS] token is included in the model. A "null answer" score is introduced analogously to the above prediction score as

$$S \cdot C + E \cdot C,$$

where $$C \in \mathbb{R}^H$$ is the hidden representation of the [CLS] token. The final model predicts an unanswerable question whenever

$$S \cdot C + E \cdot C + \tau \geq \max_{i,j} (S \cdot T_i + E \cdot T_j),$$

where $$\tau \in \mathbb{R}$$ is a precomputed threshold such that this decision rule maximizes the $$F1$$ score on the training data. In case that the model decides that the question is answerable, the standard inference score on the right hand side is used.

Domain specific use case: biomedical research papers

BERT has been adapted to many particular domains. A specific example is BioBERT, which has been designed for biomedical text mining and is especially aimed towards research papers. BioBERT is essentially a version of BERT which has been initialized with the original BERT weights and further pretraining on domain specific biomedical text corpora for example with papers from the PubMed database. As a result, tasks like named entity recognition can be used to detect abbreviations and technical biomedical terms in texts.

The BioBERT NER model can be tested for example in this web application. The model extracts named entities with respect to their corresponding categories, for example "drugs/chemicals" or "species".

A particularly interesting application is answering questions related to biomedical research papers with BioBERT based on the original idea of solving SQuAD 1.1/2.0 with BERT (see here for the original paper which proposed BioBERT for biomedical question answering).

The needed data for finetuning comes from the BioASQ challenge, which proposes a benchmark dataset for biomedical semantic indexing and question answering. The general idea is simple: just take the pretrained version of BioBERT and perform finetuning on the BioASQ data with the architectural design that is used for solving SQuAD 1.1/2.0 with the original version of BERT.

It is generally distinguished between yes/no questions and so-called factoid questions (that is, question targeted at answers contained in specific components of the corresponding text just as in SQuAD). Yes/no questions are answered by using a final sigmoid layer on top of the hidden representation $$C \in \mathbb{R}^H$$ of the [CLS] token, which predicts the probability of the answer "yes":

$$P_{yes} = \frac{1}{1 + \exp{(-C\cdot W})},$$

where $$W \in \mathbb{R}^H$$ is the weight vector in the output layer of the model. As objective function, the binary cross entropy with respect to yes/no labels is used.

Practical application: COVID-19 and CORD-19

A specific application of BioBERT is knowledge extraction from the CORD-19 dataset, which contains a wide range of academic articles from various research fields including epidemiology and virology. CORD-19 was initiated as an open source dataset for researchers looking for insights about the 2019 coronavirus pandemic. In a corresponding Kaggle challenge, a list of potentially relevant scientific questions about the novel coronavirus was published. The CORD-19 dataset is updated regularly and contains ~63000 articles (most of them with full text) at the time of writing this article.

Lets see how the BioASQ model performs on CORD-19 papers when we ask typical questions from the Kaggle challenge. For our first test, we choose the abstract of this paper:

G. Kampf, D. Todt, S. Pfaender, E. Steinmann, Persistence of coronaviruses on inanimate surfaces and their inactivation with biocidal agents, Journal of Hospital Infection, Volume 104, Issue 3, 2020, Pages 246-251, ISSN 0195-6701,

The answers to the questions which we will ask below are highlighted in bold.

Currently, the emergence of a novel human coronavirus, SARS-CoV-2, has become a global health concern causing severe respiratory tract infections in humans. Human-to-human transmissions have been described with incubation times between 2-10 days, facilitating its spread via droplets, contaminated hands or surfaces. We therefore reviewed the literature on all available information about the persistence of human and veterinary coronaviruses on inanimate surfaces as well as inactivation strategies with biocidal agents used for chemical disinfection, e.g. in healthcare facilities. The analysis of 22 studies reveals that human coronaviruses such as Severe Acute Respiratory Syndrome (SARS) coronavirus, Middle East Respiratory Syndrome (MERS) coronavirus or endemic human coronaviruses (HCoV) can persist on inanimate surfaces like metal, glass or plastic for up to 9 days, but can be efficiently inactivated by surface disinfection procedures with 62-71% ethanol, 0.5% hydrogen peroxide or 0.1% sodium hypochlorite within 1 minute. Other biocidal agents such as 0.05-0.2% benzalkonium chloride or 0.02% chlorhexidine digluconate are less effective. As no specific therapies are available for SARS-CoV-2, early containment and prevention of further spread will be crucial to stop the ongoing outbreak and to control this novel infectious thread.

It contains several specific answers to typical questions in terms of numerical values and is well suited for our scenario. We formulate our questions by using synonyms not contained in the text, such that a matching of the particular answer can not immediately be deduced by a corresponding signal word. As an example, when the text contains the term "biocidal agent", we use "sanitizer" in our question. Instead of "persist", we use "survive" etc.

The model is able to associate different timespans contained in the text with the corresponding question.

Question: How long can the coronavirus survive on surfaces?

Answer: "up to 9 days"

Question: How long does it take to disinfect surfaces?

Answer: "1 minute"

Question: How long is the latent period?

Answer: "between 2-10 days"

Additionally, it correctly extracts series of items (even with corresponding agent concentration values in the case of sanitizers). 

Question: Which sanitizer is the best to disinfect surfaces?

Answer: "62-71% ethanol, 0.5% hydrogen peroxide or 0.1% sodium hypochlorite"

Question: How is the coronavirus spread?

Answer: "via droplets, contaminated hands or surfaces"

Although these results are really impressive, some caveats need to be considered. Since the BioASQ model we used is trained on SQuAD 1.1, it can not decide whether the target text is unable to provide information about a specific question. When we ask for information which is clearly not contained in the text, we just get the answer span with the highest score. Depending on the context, these answers may make more or less sense:

Question: What are the risk factors for the coronavirus?

Answer: "droplets, contaminated hands or surfaces"

Question: How tall is the Eiffel Tower?

Answer: "between 2-10 days"

Additionally, it is not straight forward to perform text mining on a large number of papers at once with BioBERT. In order to extract answers from multiple papers, each question would need to be fed into the model together with a single paper from the corpus. In the end, an answer scoring would need to be performed to decide the "best" possible answer contained in the corpus. Typically, such an approach is suboptimal for the reason of computational inefficiency alone. However, alternatives have been developed exactly for the scenario of question answering based on a large corpus of documents. This is beyond the scope of this article, but the reader may consult for example this paper for a real-time question answering model.


  • A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining; Kim et al., 2019.

  • Attention Is All You Need; Vaswani et al., 2017.

  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; Devlin et al., 2018.

  • BioBERT: a pre-trained biomedical language representation model for biomedical text mining; Lee et al., 2019.

  • Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation; Wu et al., 2016.

  • Neural machine translation by jointly learning to align and translate; Bahdanau et al. 2015.

  • Pre-trained Language Model for Biomedical Question Answering; Yoon et al., 2019.

  • Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index; Seo et al., 2019.

  • The Annotated Transformer; Rush, Nguyen and Klein.