Dureader: a Chinese Machine Reading Comprehension Dataset From Real-world Applications

Wei He, Kai Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang,
Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu, Haifeng Wang
Baidu Inc., Beijing, China
{hewei06, liukai20, lvyajuan, zhaoshiqi, xiaoxinyan, liuyuan04, wangyizhong01,
wu_hua, sheqiaoqiao, liuxuan, wutian,

Abstract

In this paper, nosotros introduce DuReader, a new big-scale, open-domain Chinese machine reading comprehension (MRC) dataset, aiming to tackle existent-globe MRC problems. In comparison to prior datasets, DuReader has the post-obit characteristics: (a) the questions and the documents are all extracted from real application data, and the answers are homo generated; (b) it provides rich annotations for question types, especially yes-no and opinion questions, which take a big proportion in real users' questions merely accept not been well studied before; (c) it provides multiple answers for each question. The showtime release of DuReader contains 200k questions, 1,000k documents, and 420k answers, which, to the best of our knowledge, is the largest Chinese MRC dataset so far. Experimental results show in that location exists big gap between the state-of-the-fine art baseline systems and human performance, which indicates DuReader is a challenging dataset that deserves future study. The dataset and the code of the baseline systems are publicly available now ¹ ^one 1The DuReader dataset is bachelor at http://ai.baidu.com/broad/subordinate?dataset=dureader. The code of the baseline systems is open sourced at https://github.com/baidu/DuReader .

ane Introduction

For human being beings, reading comprehension is a basic ability to acquire cognition. We believe it is one of the crucial abilities motorcar has to have to acquire knowledge through reading the whole spider web and answer open domain questions. Such an power is considered to be of great value for next-generation search engines and intelligent agent products. Still, Automobile Reading Comprehension (MRC) is an extremely challenging piece of work since it involves several difficult tasks such every bit comprehension, inference and summarization.

Recently, several MRC datasets have been released, greatly inspiring the research in this field. A series of neural network models, such as Match-LSTM [\citenameWang and Jiang2017], BiDAF [\citenameSeo et al.2016], R-net [\citenameWang et al.2017], take been proposed, achieving promising results on a variety of MRC evaluation tasks.

Tabular array ane: Comparing of some backdrop of existing datasets ⁴ ^iv 4CNN/Daily Postal service [ \citenameHermann et al.2015] , HLF-RC [ \citenameCui et al.2016] , RACE [ \citenameLai et al.2017] , NewsQA [ \citenameTrischler et al.2017] , Team [ \citenameRajpurkar et al.2016] , TrivaQA [ \citenameJoshi et al.2017] , MS-MARCO [ \citenameNguyen et al.2016] vs. DuReader.

Table 2: Examples for question types from two views.

However, almost existing MRC datasets have some limitations due to their synthetic data, simplified tasks or constrained domains. Therefore, studies on these datasets are different from existent-world comprehension tasks. In item, cloze-style MRC [\citenameHermann et al.2015, \citenameColina et al.2015, \citenameCui et al.2016] simplifies the task into word prediction on hole-digging synthesis information. Multiple-choice MRC [\citenameLai et al.2017] tests comprehension ability via option selection on examination data. Question answering based MRC [\citenameTrischler et al.2017, \citenameRajpurkar et al.2016, \citenameJoshi et al.2017] usually casts reading comprehension as the prediction of span in a news article, a Wikipedia entry or other documents for a given question. Although such kinds of simplifications and constraints facilitate the data construction and the model design, they bring some undesired problems. By analyzing questions real users submitted to Baidu search engine, nosotros constitute that electric current datasets cover simply some types of questions, leaving other types, such as opinion questions and complex clarification questions, non well studied. Furthermore, recent studies [\citenameChen et al.2016, \citenameJia and Liang2017] have shown that current MRC models could achieve high operation on many of these datasets with limited comprehending or inferring.

Therefore, it is necessary to build real-world reading comprehension datasets in open domain. An English dataset, MS-MARCO [\citenameNguyen et al.2016] was released under this consideration, in which the questions and documents were collected from search engine, and answers were generated by human annotators. In this paper, nosotros propose DuReader, a new large-scale and homo annotated MRC dataset in Chinese language, aiming to tackle real-world MRC problems. Besides its claim that the questions are open up-domain and extracted from real application data, DuReader has the following characteristics compared to previous datasets.

DuReader provides rich annotations for question types. In particular, DuReader annotates yes-no and opinion questions that take a big proportion in existent user'south questions but have not been well studied before. Answering opinion questions usually requires inference and summarization of multiple evidences, which are challenging even for human.
DuReader collects documents from the search results of Baidu search engine and from a question answering community site Baidu Zhidao. All the contents in Zhidao are contributed past its users, making its documents more colloquial and unlike from common spider web pages.
DuReader provides multiple answers for each question. Nigh previous datasets have only ane answer for each question while in existent world, one question may accept one or several answers depending on what the question is. DuReader tin reflect real-world applications more other datasets.

The offset release of DuReader contains 200k questions, 1M documents and more 420k human-summarized answers. To the best of our knowledge, DuReader is the largest Chinese MRC dataset and then far. The comparison of some key properties of DuReader and the existing datasets is shown in Table four.

We implemented two country-of-the-art MRC models, i.eastward., Match-LSTM [\citenameWang and Jiang2017] and BiDAF [\citenameSeo et al.2016] on DuReader. Nosotros find that the performances of these models are far inferior to human, which suggests that in that location is a large room for researchers to better the MRC models on DuReader dataset.

2 Analysis of Questions in Search Engine

In this section, we clarify the distribution of questions in Baidu Search information. Nosotros randomly sample 1,000 questions from 1 twenty-four hour period'southward search log, so manually annotate the questions from two different views. From the view of the answer blazon that a question belongs to, we classify the questions into three kinds: Entity, Description and YesNo. For Entity questions, the answers are expected to be a unmarried entity or a list of entities. For Clarification questions, the answers are usually multi-sentence summaries. This kind of questions contain how/why questions, questions of comparing the functions of two or more objects, questions most inquiring the claim/demerits of a goods, etc. Equally for YesNo questions, the answers are expected to be an affirmative or negative answers with supportive evidences.

After a deep investigation into the questions, nosotros found that whichever respond type a question belongs to, information technology can be further classified into Fact or Opinion, depending on whether it is most a fact or an stance ⁵ ⁵ vAccording to the definition of opinion in wikipedia (https://en.wikipedia.org/wiki/Opinion), an opinion is a judgment, viewpoint, or statement that is not conclusive. Information technology may deal with subjective matters in which there is no conclusive finding. What distinguishes fact from opinion is that facts are more likely to be verifiable, i.e. can be agreed to by the consensus of experts. . Table 2 shows some examples.

For each question in the sampled data, we label information technology from two views: ane is the reply blazon information technology belongs to, the other is whether information technology is nearly fact or stance. In this way, the questions tin be classified into six classes. The distribution of the questions in the sample data is shown in Table 3.

Tabular array three: Distribution of question types in Baidu search data.

From the distribution, some interesting phenomena are observed:

The Entity-Fact questions, also known as factoid questions that have been widely studied in previous work, account simply for 23.4%.
Over half of the questions (52.5%) are Clarification questions. Previous studies mostly focus on Description-Fact questions.
YesNo questions accounts for fifteen.half-dozen%, with 1 half about fact, some other half well-nigh stance.
More than one-third of questions are Stance questions, seldom addressed in the previous enquiry.

To the best of our knowledge, it is the starting time time to analyze the MRC dataset from two different views. Some of the question types have been widely studied in previous work while others, especially YesNo questions and Stance questions, are expecting more than attending from researchers. Hopefully, our datasets can promote further researches on them.

3 DuReader Dataset

In this section, nosotros will innovate the data collection and annotation procedure of DuReader. The DuReader dataset can be considered as a set up of quadruples of {q, t, D, A}, which are defined equally: (a) the question q; (b) the question blazon t; (c) the relevant document set D=d ,d ,…,d , which contains documents; (d) reference answers set A which is generated by human annotators.

3.i Data Collection and Annotation

3.one.i Information Collection

To collect questions for DuReader, we first randomly sampled oftentimes occurring queries from Baidu search engine query logs. Questions were filtered from the queries using a binary classifier, which were and then double-checked by human annotators. 200K questions are reserved in this release.

We then collected relevant documents for the questions. 2 sources were explored, i.e., search results of Baidu search engine, and Baidu Zhidao ^half-dozen ^half-dozen 6Zhidao (https://zhidao.baidu.com) is the largest Chinese community-based question answering (CQA) site in the world. . In detail, the 200K questions were divided into two sets. For the first half, we searched each question in Baidu search engine and kept pinnacle-5 search results, while for the second half, we searched each question in Zhidao's site search and likewise kept top-5 results. The reason why we use Zhidao as a source of relevant documents, is that the User Generated Content (UGC) nature of Zhidao makes its documents different from random spider web pages on the Internet. Especially, for the stance questions, there are more answers in Zhidao.

For each document, we extracted the title and master contents, which were and then word-segmented using the open up API of the Baidu AI platform ⁷ ⁷ 7http://ai.baidu.com/tech/nlp/lexical .

three.1.two Question Type Annotation

Figure 1: Distribution of answer numbers per question in DuReader.

According to the two-dimension question types introduced in Department 2, the annotators were asked to characterization each question in a ii-pass style. In the first pass, the annotators classified all the questions into three types: Entity, Description and YesNo questions. And in the second pass, the annotators labeled each question as either Fact or Opinion. The distribution of questions of different types in DuReader is shown in Table 4. Note that the distribution of question types in DuReader (Table 4) is unlike from that in Baidu Search (Table 3). This is mainly because Tabular array 4 is type-based statistics, since nosotros keep only one instance in the dataset for aforementioned questions, while the statistics for Table iii is frequency-based.

3.i.3 Reply Note

For the answer annotation, nosotros employed crowd-sourcing workers to generate answers for each question based on the relevant documents. Specifically, each question and its relevant documents were shown to an annotator. He/she was asked to generate answers in his/her own words past reading and summarizing the documents. If more than one respond can exist found in the relevant documents, the analyst was required to write downwards all the answers. Those answers that are very similar to each other were merged into only one. The answers were pointed-checked to guarantee that the accurateness is high enough.

Effigy two: Distribution of edit distance between answers and original documents.

Tabular array 4: Distribution of question types in DuReader.

Specifically, for the Entity questions, the answers include both the entities and the sentences containing them. For the YesNo questions, the answers include the opinion types (Yes, No or Depend) as well equally the supporting sentences. (See the concluding case in Table 9 )

3.2 Data Analyzing

Statistics on length. On average, each question and answer has 4.eight and 69.6 words respectively. The boilerplate length of the documents is 396.0 words, which is about five times longer than those in MS-MARCO. The reason is that we kept the full body of each relevant certificate whereas MS-MARCO but apply a certain paragraph.

Statistics on answer numbers. Figure 1 shows the statistics over number of answers. We tin see that i.five% of Baidu Search questions have zero answers, only this number increases to 9.7% for Baidu Zhidao. Meanwhile, Baidu Zhidao has a larger proportion of multi-reply questions than Baidu Search (70.eight% vs. 62.ii%), which may be explained equally the UGC answers are more subjective and cause more than variety.

Difficulty analysis of DuReader. In guild to understand the difficulty to reply the problem in DuReader, we summate the distribution of minimum edit distance (ED) between the answers generated past human and the original documents ^eight ⁸ eightHere ED is the minimum edit distance between the answer and any consecutive span in the certificate. . The larger ED is, the more summarization and modification has been operated by the annotators, which requires more complex methods for modeling the problem. For the span-reply datasets, such every bit SQuAD [\citenameRajpurkar et al.2016], NewsQA [\citenameTrischler et al.2017] and TriviaQA [\citenameJoshi et al.2017] the ED score should exist zero, since the answers are all directly extracted from the document. Figure 2 shows the distribution of ED scores betwixt the answers and documents in DuReader, which is compared with those in MS-MARCO. We can see that for 77.1% of answers in MS-MARCO the ED is below 3. In contrast, 51.3% of DuReader answers has a ED score over 10, which can be inferred that DuReader requires more circuitous techniques such as text summarization and generation.

Tabular array five: Performance of typical MRC systems on the DuReader dataset.

4 Experiments

In this section, nosotros implement MRC systems with ii country-of-the-art models. BLEU [\citenamePapineni et al.2002] and Rouge[\citenameLin2004] are used as the bones evaluation metrics. Furthermore, with the rich annotations in our dataset, including the queries and answers of various types, we deport comprehensive evaluations from unlike aspects.

Tabular array 6: Model operation with gold paragraph.

Tabular array seven: Performance on diverse question types.

iv.ane Baseline Systems

We implement two typical land-of-the-fine art models as baseline systems.

Lucifer-LSTM Lucifer-LSTM is a widely used MRC model and has been well explored in recent studies [\citenameWang and Jiang2017]. To find an reply in the passage, it goes through the passage sequentially and dynamically aggregates the matching of an attention-weighted question representation to each token of the passage. Finally, an answer pointer layer is used to find an answer span in the passage.

BiDAF BiDAF is a promising MRC model, and its improved version has achieved the best single model performance on Squad dataset [\citenameSeo et al.2016]. Information technology uses both context-to-question attention and question-to-context attention in social club to highlight the important parts in both question and context. After that, the so-called attention menstruum layer is used to fuse all useful data in gild to get a vector representation for each position.

To set, we randomly initialize the word embeddings with a dimension of 300 and set the subconscious vector size equally 150 for all layers. We employ the Adam algorithm [\citenameKingma and Ba2014] to railroad train both models with an initial learning rate of 0.001 and a batch size of 32. Since for every question there may exist multiple corresponding passages. To improve preparation and testing efficiency, a simple heuristic strategy is employed to select a representative paragraph from each passage. This paragraph is supposed to be the 1 that achieves the highest remember score when compared against the annotated answers during grooming. While for testing, since the answers are not available, nosotros compute the recall score against the question instead.

iv.ii Results and Analysis

Table 8: Performance of opinion-aware model on YesNo questions.

We evaluate the reading comprehension task via grapheme-level BLEU-iv [\citenamePapineni et al.2002] and Rouge-50 [\citenameLin2004], which are widely used for evaluating the quality of language generation. The experimental results on exam set are shown in Tabular array 5. For comparison, we also evaluate the Selected Paragraph system, which direct selects the paragraph that achieving the highest recall score every bit answer. And we assess human being performance by involving a new annotator to annotate on the test data and treat his outset reply as the prediction.

The results demonstrate that current reading comprehension models can achieve an impressive improvement compared with the selected paragraph baseline, which approves the effectiveness of these models. However, at that place is still a large operation gap between these models and human. An interesting discovery comes from the comparison between results on Search and Zhidao information. We notice that the reading comprehension models get much higher score on Zhidao data. This shows that it is much harder for the models to comprehend open-domain web articles than to find answers in passages from a question answering community. In contrast, the performance of human beings on these ii datasets shows little difference, which suggests that human's reading skill is more stable on different types of documents.

As described in Section iv.1, the representative paragraph of each passage is selected based on the query during testing. To analyze the upshot of the strategy and obtain the upper bound of the baseline models, we re-evaluate our systems on the gold paragraphs, each of which is selected if its recall score against the annotated answers is the highest. Comparing Table 6 with Table v, we can meet that the use of gold paragraphs could significantly boosts the overall performance. Moreover, directly using the gold paragraph can obtain a very high Rouge-Fifty score, which is equally expected considering each gold paragraph is selected based on call up, which is relevant to Rouge-L. Though, we all the same detect that the baseline models tin can go much meliorate functioning with respect to BLEU, which means the models have learned to refine the answers. The experiment shows that paragraph choice is a crucial problem to solve in real applications, while well-nigh current MRC datasets suppose to find the answer in a given small passage. Thus, DuReader provides full body text of evidence certificate to stimulate inquiry in real-world setting.

To gain more insight into the characteristics of our dataset, we report the performance across different question types in Table 7. We can encounter that both the models and human reach relatively good performance on description questions, while YesNo questions seem to be the hardest to model. We consider that clarification questions are commonly answered with long text on the aforementioned topic. This is preferred past BLEU or Rouge. Nevertheless, the answers of YesNo questions are relatively short, which could be a uncomplicated Yes or No in some cases. Even more interesting is, the answers of some YesNo questions are quite subjective and some may even exist contradictory based on the evidence collected from different passages. Therefore, even the human annotators cannot accomplish a high level of agreement for these questions.

4.3 Stance-enlightened Evaluation

Because the characteristics of YesNo questions, nosotros plant that it's non suitable to direct use BLEU or Rouge to evaluate the performance on these questions, because these metrics could not reflect the agreement between answers. For case, 2 contradictory answers similar "You tin do information technology" and "You can't exercise it" get loftier agreement scores with these metrics. A natural idea is to formulate this subtask every bit a classification problem. Still, as described in Section three, multiple dissimilar judgments could be made based on the evidence nerveless from dissimilar passages, especially when the question is of opinion type. In real-earth settings, we definitely don't desire a smart model to requite an arbitrary answer for such questions as Yeah or No.

To tackle this, nosotros propose a novel opinion-aware evaluation method that requires the evaluated system to non only output an answer in natural linguistic communication, but also give it an opinion characterization. We too have the annotators provide the opinion label for each answer they generated. In such cases, every answer is paired with an opinion label (Yes, No or Depend) then that we can categorize the answers by their labels. Finally, the predicted answers are evaluated via Blue or Rouge confronting just the reference answers with the aforementioned opinion label. By using this opinion-aware evaluation method, a model that can predict a good answer in tongue and requite it an opinion label correctly volition go a higher score.

In order to classify the answers into different stance polarities, we add a classifier. We slightly change the Match-LSTM model, in which the last arrow network layer is replaced with a fully connected layer. This classifier is trained with the gold answers and their corresponding stance labels. We compare a reading comprehension organization equipped with such an opinion classifier with a pure reading comprehension system without it, and the results are demonstrated in Table eight. Nosotros tin can see that doing opinion classification does help under our evaluation method. Also, classifying the answers correctly is much harder for the questions of opinion type than for those of fact type.

4.iv Give-and-take

As shown in the experiments, the current country-of-the-fine art models yet underperform man beings by a large margin on our dataset. There is considerable room for improvement on several directions.

First, the state-of-the-art models formulate reading comprehension as a span selection task. Notwithstanding, as shown in DuReader dataset, homo beings actually summarize answers with their own comprehension. How to summarize or generate the answers deserves more than research. Electric current methods utilise a unproblematic paragraph selection strategy, which results in groovy degradation of comprehension accurateness equally compared to gold paragraph'due south performance. It is necessary to design novel and efficient whole-document representation models for the existent-globe MRC problem.

2d, in that location are some new features in our dataset that accept not been extensively studied before, such as aye-no questions and opinion questions requiring multi-document MRC. New methods are needed for stance recognition, cantankerous-judgement reasoning, and multi-document summarization. Hopefully, DuReader'due south rich annotations would be useful for study of these potential directions.

Third, every bit the first release of the dataset, it is far from perfection and it leaves much room for improvement. For example, we annotate simply opinion tags for yes-no questions, we volition as well comment opinion tags for description and entity questions. Nosotros would like to gather feedback from the research community to improve DuReader continually.

Overall it is necessary to advise new algorithms and models to tackle with real-globe reading comprehension problems. We hope that the DuReader dataset would be a good start for facilitating the MRC research.

Table ix: Examples from DuReader dataset

v Conclusion and Future Work

Nosotros introduce DuReader, a new Chinese large-calibration open domain dataset for machine reading comprehension. Dissimilar from exiting Chinese MRC datasets, DuReader contains questions and possible answers from existent-world applications, with the aim to promote MRC research in real-world setting. In particular, DuReader contains rich annotations of questions, documents and answers. It is the first time to comment the questions from two different views, among which yep-no and opinion questions account for a big proportion but accept not been well studied even so. For each question, we provide documents coming from both Baidu Search and Baidu Zhidao, and multi-answers with supporting testify, possible entities and opinions labelled. Hopefully, these annotations could help in facilitating MRC research. Preliminary experimental results show that there exists a significant gap between the performances of country-of-the-fine art models and that of humans on this dataset.

In future work, we volition steadily update our dataset past enlarging the size and enriching the annotations based on feedbacks from the community. We expect DuReader will exist a valuable resource to the development of MRC technologies and applications.

References

[\citenameChen et al.2016] Danqi Chen, Jason Bolton, and Christopher D Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension chore. In Proceedings of the 54th Almanac Meeting of the Association for Computational Linguistics, pages 2358–2367.
[\citenameCui et al.2016] Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, and Guoping Hu. 2016. Consensus attention-based neural networks for chinese reading comprehension.
[\citenameHermann et al.2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and embrace. In Advances in Neural Data Processing Systems, pages 1693–1701.
[\citenameColina et al.2015] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children's books with explicit memory representations. arXiv preprint arXiv:1511.02301.
[\citenameJia and Liang2017] Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Tongue Processing, pages 2011–2021.
[\citenameJoshi et al.2017] Mandar Joshi, Eunsol Choi, Daniel Southward. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large calibration distantly supervised challenge dataset for reading comprehension. CoRR.
[\citenameKingma and Ba2014] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR.
[\citenameLai et al.2017] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-calibration reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
[\citenameLin2004] Chin-Yew Lin. 2004. Rouge: A package for automated evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81.
[\citenameNguyen et al.2016] Tri Nguyen, Mir Rosenberg, Xia Vocal, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
[\citenamePapineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
[\citenameRajpurkar et al.2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
[\citenameSeo et al.2016] Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attending flow for automobile comprehension. CoRR.
[\citenameTrischler et al.2017] Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. Newsqa: A machine comprehension dataset. In Proceedings of the second Workshop on Representation Learning for NLP, pages 191–200.
[\citenameWang and Jiang2017] Shuohang Wang and Jing Jiang. 2017. Auto comprehension using lucifer-lstm and answer arrow. In ICLR, pages 1–xv.
[\citenameWang et al.2017] Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated Self-Matching Networks for Reading Comprehension and Question Answering. In Proceedings of the 55th Almanac Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 189–198.

sawyersatho1977.blogspot.com

Source: https://www.arxiv-vanity.com/papers/1711.05073/