Machine reading has become an important field of research partly due to the huge amount of unstructured textual data now available on the web. To be able to not only automatically extract relevant information from this data but actually understand it, is a challenging yet very valuable capability.
To evaluate just how good a system is at reading has most often been done by using question-answering as a proxy task whereby the reading model has to answer questions that come with associated pieces of text. Deep reading models have recently demonstrated promising results and several have already shown superhuman performance on some popular reading corpora [1, 2].
However, the relevant information regarding the given questions in existing datasets [1, 2], is often concentrated in a very narrow part of each document. This favours machine reading models designed to learn how to cleverly extract a span of the source document, based on its similarity with the question, instead of looking for evidence throughout the entire document. An efficient reading machine should be able to detect relevant passages in a document regarding a question, but more importantly, it should be able to reason across the relevant parts of the document to inform its answers.
To tackle this limitation, we propose ReviewQA, a relational reasoning dataset of question-answering based on hotel reviews. The dataset leverages a corpus of hotel reviews, originally proposed in [3, 4]. It contains more than 500 000 questions in natural language over a total of 100 000 hotel reviews. The questions in ReviewQA are designed to require different competencies to be able to answer them. Each question comes with an associated type that characterises the required competency. You can use the corpus to benchmark state-of-the-art models (as we have done) and get an overview of what their strengths and weaknesses are on the set of tasks. In contrast to the most recent datasets, in ReviewQA the answer to a question doesn’t need to be extracted from a document. It’s selected from a set of candidates that contains all the possible answers to the questions of the dataset.
Overview of tasks
We propose the evaluation of 8 reasoning competencies based on hotel reviews. Table 1 below describes these tasks with one possible corresponding question.
How ReviewQA was constructed
To construct ReviewQA, we leveraged a dataset of hotel reviews originally proposed for a sentiment analysis task [3, 4]. Each individual review comes with a set of rated aspects ranging from from 1 to 5 stars as show in the image below.
We use this data to automatically generate natural language questions based on the ratings. The questions proposed in our corpus evaluate the ability of a machine reader in understanding a review and the relation between the different aspect ratings. Take note that ReviewQA no longer contains explicit information on the ratings (the stars), only the user generated comment associated with the natural language questions as shown below.
ReviewQA contains more than 500 000 questions over 100 000 reviews. The task is ‘projective machine reading’ (whereby an answer doesn’t have to be extracted from a document but selected from among all the possible answers in the dataset). Each review contains on average around 200 words.
Models and results
Table 2 presents the results of the 4 models we tested on the dataset: a logistic regression, an LSTM network , an End-to-End Memory Network  and a Deep projective reader. The latter is a projective reader we designed from the building blocks of an R-Net , a state-of-the-art extractive reader. An overview of its architecture is displayed in Figure 2 and more details can be found in our paper ‘ReviewQA: a relational aspect-based opinion reading dataset’.
All the models were jointly trained on the 8 tasks and we present the results on the overall dataset as well as for each individual task.
Download the ReviewQA dataset .
Alternatively it can be found on Github. As explained earlier, to construct ReviewQA we used a dataset of hotel reviews originally proposed in [3, 4] for a sentiment analysis task. The original reviews can be downloaded at: http://www.cs.virginia.edu/~hw5x/Data/LARA/TripAdvisor/TripAdvisorJson.tar.bz2
We hope ReviewQA will encourage the research community to further develop reasoning models and evaluate them on this set of tasks.
 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev and Percy Liang, SQuAD: 100,000+ Questions for Machine Comprehension of Text.
 Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder and Li Deng, MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.
 Hongning Wang, Yue Lu and ChengXiang Zhai, Latent aspect rating analysis without aspect keyword supervision.
 Hongning Wang, Yue Lu and Chengxiang Zhai, Latent aspect rating analysis on review text data: a rating regression approach.
 Hochreiter Sepp and Schmidhuber Jürgen, Long Short-Term Memory.
 Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, End-To-End Memory Networks.
 Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, Ming Zhou, Gated Self-Matching Networks for Reading Comprehension and Question Answering.