Enhancing Search with Question Answering

Angela Maennel

Question and answer signs

What is called open-domain question answering in machine learning papers is nothing else than answering a question based on a large collection of texts, such as answering the question of a visitor of a large website, using the website's content.

Due to recent progress in machine reading comprehension, open-domain question answering systems have drastically improved. They used to rely on redundancy of information but now they are able to “read” more carefully. Modern systems are able to quote a section of text that answers the question or even reformulate it. What is still an aspiration is to generate longer, paragraph-length answers or to use multiple sources to puzzle together an answer.

 example for question answering with machine learning

Google recently implemented such a feature into their search engine. If they find a passage that answers the question typed into the search field, the first result shows the corresponding website with the passage highlighted.

There are many different systems that tackle open-domain question answering, here I will go into detail on one system in particular, DrQA (by Chen et al. 2017). This particular system splits the task into two parts for each of which it is easier to get data than for the combined task. I will also explain how this idea can be used to create a question answering system for a website from an already existing search function.

Quick Overview of DrQA

The input for DrQA is a database of texts and a question. The system tries to find a passage in the collection of texts that answers the question. The output is this passage.

The task of finding a fitting passage is split into two parts:

  1. A Document Retriever that selects a small number of documents, which are most relevant to the question.

  2. A Document Reader that extracts a passage out of each document passed on from the Document Retriever. This passage should be the most relevant to the question out of this document and Document Reader should also give each passage a score on how well it answers the question.

The passage with the highest score is then displayed to the user. In the DrQA system the Document Retriever always retrieves exactly 5 documents, but depending on the Document Retriever, one might want to adjust this number.

How to use Question Answering to enhance your search

Having not only a search function but a question answering system can save users a lot of time and headaches. Instead of having to go through the search results manually and scan through the pages to find what they are looking for, the page and passage where the answer is located is shown.

In case there is no explicit answer, or the document reader can’t find the answer, the usual search results can be shown.

If you already have a website search function this can be used to create a question answering system by supplementing it with a suitable document reader. Depending on the content, the document reader can be trained on public question answering sets, based on Wikipedia, or the language model might have to be fine-tuned. For fine-tuning, text from the website or publicly available data/texts that are similar to the content of the website can be used.

If you are now curious and want to see a Document Reader in action, checkout our question answering demo.

If you are curious about how a Document Retriever or Document Reader is set up you can find that out in the next two sections of this blog post.

Document Retriever

Since we assume the database to be large, the Document Retriever has relatively little time to check if any one document is relevant for a given question. Since we also assume that the database won’t change (at least not quickly), one can extract features from each document and store them to make the response time shorter.

This is much like coming up with a title for a scientific article: while it takes some time to come up with a good name, it makes it a lot easier for people to find the articles that are relevant to their research question.

Here rather than titles, vectors are usually used. To compare a document to a question, the question is also represented as a vector and then the angle or dot product between the vector representation of question and document is commonly used (if you want to know more about how and why vector representations are used I recommend reading this blog post). There are a variety of different Document Retrievers, but they all follow this core principle. The main difference between them is on how they decide on the representation of documents and questions.

A feature that is commonly used is Term Frequency-Inverse document Frequency (or short TF-IDF). The heuristic behind it is that if a term such as “neural network” appears in the question and a given document, but is rare across all documents, then this should be a strong indication that the document is relevant. If on the other hand the term “neural network” is contained in all documents (because they are for example from a machine learning conference) then this piece of information says little about the relative relevancy of the document.

This can be used to create a vector representation in the following way: We take a set of terms $$T$$ (each term appears at least once in the collection of documents). We will represent documents and questions as ​$$|T|$$​​-dimensional vectors, one entry corresponding to each term. For the question an entry corresponding to a term $$t$$ is $$1$$ if the term appears in the question and otherwise $$0$$. For a document $$d$$ the entry corresponding to a term $$t$$ is equal to the following expression:

$$\mathrm{TF}{\text -}\mathrm{IDF}(t,d,D) = \mathrm{TF}(t,d) \cdot \mathrm{IDF}(t,D) = \frac{freq_{t,d}}{\sum_{t' \in D} freq_{t', D}} \cdot \log(\frac{|D|}{|\{d' \in D: t \in d' \}|})$$

  • $$freq_{t,d}$$ is the number of times the term $$t$$ appears in the document $$d$$

  • $$D$$ is the collection of all documents

Very recently, training neural networks to come up with vector representations has yielded even better results than using algorithms with “hand-picked” features, like TF-IDF (see for example Karpukhin et. al 2020). However, these algorithms need large amounts of training data and computing power to come up with these representations. Algorithms using “hand-picked” features already deliver good results.

In DrQA, a TF-IDF representation was used, where the set of terms was made up of single words and terms consisting of two consecutive words occurring in the documents.

In principle any search algorithm can be used as a Document Retriever. In the case of a website, one usually doesn’t only have structureless text. Using structures such as titles and links might be one way to improve the Document Retriever. If it is an established website, user behaviour might also help.

Document Reader

As I mentioned earlier, the Document Reader is the part where significant progress was made. Deep learning played a key role for this development as did a neural network architecture component called an attention mechanism or just attention. Using an attention mechanism was first proposed by Bahdanau et. al in 2015.The idea behind it is to let the neural network learn to focus its attention on certain key features, such as subject and verb in a sentence or on faces of humans in pictures.

A very successful language model called BERT heavily relies on such attention mechanisms. My colleague Mattes Mollenhauer wrote a blog post on how to use BERT to create a document reader. Mattes also recorded a webinar if you prefer video format.

In the following let me explain how the document reader used in DrQA is set up. It can be split into three parts:

  • a paragraph encoding,

  • a question encoding and

  • calculating a prediction from these encodings.

Before any encodings are applied, the sentences are split into words, punctuation is removed and beginning/end of sentence markers are added.

As most neural networks, the neural networks used here also require a fixed input size. That is why a maximum length for question and context (document) are chosen. If the question or context is shorter than the minimum length “empty” markers are added to the end. For the question, a relatively long maximum length is chosen (e.g. 100 words). If a question is too long it would be cut off, but in practice this will almost never happen. Documents tend to be long and can vary greatly in their length. Using a too long maximum length tends to decrease performance. Therefore the document is split into paragraphs of maximum length​ (often 300-400 words works well).

Paragraph encoding

Paragraphs are translated into a sequence of vectors. To do this, first a word embedding is used (a learned map from words to vectors). Note that a word embedding will map the word “bank” in “river bank” and “bank credit” to the same vector, even though they have quite different meanings. The sequence of vectors is then refined by a bidirectional RNN (Recurrent Neural Network). This neural network architecture was especially designed to take the context (on both sides, hence “bidirectional”) into account. If you would like to learn more about RNNs, I can recommend the webinar of my colleague Fabian Gringel. While the word embedding is a more general representation, this refinement is trained on the question answering task and therefore extracts features that are important to deciding if a word is part of the passage that answers the question.

Usually a word embedding trained on one language processing task works well on a wide range of language processing tasks. Since question answering data is time consuming to label, the word embedding is trained on a task for which data can be automatically labeled. Most commonly the task of guessing a blacked out word in a text is used. To create labeled data for this task, one can take any text, choose words at random to black-out, and save these words separately as labels before blacking-out the words.

For the paragraph encoder, a pre-trained word embedding is used, but several features are added which encode

  • if the word is part of the question,

  • if it is similar to a word in the question,

  • the frequency of the word across all documents,

  • the type of token it is (beginning/end marker, part of speech or named entity).

For checking if the word is similar to a word in the question a type of attention mechanism is used.

Question Encoding

The Question Encoding is quite similar: again the words in the question are translated to vectors, this time using a pre-trained word embedding without any added features. Then the representation of the question as a sequence of vectors is refined using a RNN. Finally the sequence of vectors is combined to a single vector by taking a weighted sum.


To get a prediction from the question vector and the sequence of vectors representing a paragraph, two classifiers are trained: One for classifying if a given position is the start of the relevant passage and one if it is the end.

With these two classifiers, the score for all position pairs $$(i,j)$$ that are no further than a maximum passage length $$m$$ apart are calculated and compared. The score for a pair $$(i,j)$$ is equal to the product of the estimated probability that $$i$$ is the start of a passage that answers the question and the estimated probability that $$j$$ is the end of the passage.

Note that for almost all of these pairs the score will be very low, since often the relevant passage is not contained in the paragraph at all.

The pair with the highest score is chosen and then compared to the passages with the highest score from other paragraphs in the same document. The passages with the highest score out of each document are compared and the one with the highest score out of these five is then displayed to the user.

While it might seem like a minor detail that for each passage a score is needed, it is actually a lot more difficult to give a score to each passage that can be compared across different documents than to just select the most relevant passage out of each paragraph.


While the goal of creating a system that is capable of answering complex questions based on a collection of texts is still out of reach, there has been much progress in that direction. Current question answering systems can answer questions whose answer can be found explicitly in one of the texts. Since this is often the case when using a website search, including a question answering system lets users find the information they are looking for more quickly.