Summary of Multi-Paragraph Reading Comprehension Paper

About Paper

Key Contributions


The recent success of neural models at answering questions given a related paragraph suggests neural models have the potential to automatically extract the answers from the given documents. Training and testing these models on document level input is extremely computational process, so typically this requires adapting a paragraph level model to process document level input.

Two basic approaches for the task

  1. Pipelined approaches select a single paragraph from the input documents, which then passed to the paragraph model to extract an answer.
  2. Confidence based methods apply the model to multiple paragraphs and returns the answer with the highest confidence.

Pipelined Method

Paragraph level QA model

The above paragrah level model uses the following layers and are stated in the right side of the above figure with coloring.


The character-level and word-level embeddings are then concatenated and passed to the next layer. This model do not update the word embeddings during training


A shared bi-directional GRU (Bi-GRU) is used to map the question and passage embeddings to context-aware embeddings


The bi-directional attention mechanism is used to build a query-aware context representation.


The input of the above is passed through another bi-directional GRU. Then we apply the same attention as above between the passage and itself.


In the last layer of the model, the model computes answer start scores and answer end scores for each token.

Confidence Method

The authors adapted this model to the multi-paragraph setting by using the un-normalized and un-exponentiated score given to each span as a measure of the model’s confidence. Selecting the answer span based on the highest confidence where confidence measured as sum of the start score and end score results as below.

Paragraph level QA model

The two key reasons why model’s confidence scores might not be well caliberated are:

  1. The pre-softmax scores for all spans can be arbitrarily increased or decreased by a constant value without changing the resulting softmax probability distribution. As a result, nothing prevents models from producing scores that are arbitrarily all larger or all smaller for one paragraph than another.
  2. If the model only sees paragraphs that contain answers, it might become too confident in heuristics or patterns that are only effective when it is known a priori that an answer exists

Taking above disadvantages authors experimented with four approaches for training models. In all four approaches, authors sample paragraphs that do not contain an answer as additional training points


In this model, all paragraphs are processed independently but a modified objective function is used where the normalization factor in the softmax operation is shared between all the paragraphs from the same context. The key idea is that this will force the model to produce scores that are comparable between paragraphs


Authors experimented with concatenating all paragraphs sampled from the same context together during training. Here motive is to test whether showing more text will improve performance or not.

No-Answer option

Experimented another method which allows the model to select a special “no-answer” option for each paragraph.


Also considered training the models with sigmoid - cross entropy loss as an objective function. Since the scores are computed independently of one another, they will be comparable between different paragraphs.



When using a paragraph-level QA model across multiple paragraphs, the training method of sampling non-answer containing paragraphs while using a shared-norm objective function can be very beneficial. Combining this with suggested paragraph selection methods, using the summed training objective, and the proposed model design allows it to advance the state of the art on TriviaQA by a large stride