机器阅读理解数据集

Contributed by Luxi Xing and Yuqiang Xie. IIE, CAS.

Introduction

Based on the style of the ANSWER for each datasets, we split the datasets into several category and we will give details for each datasets.

  1. Extractive
  2. Multi-Choice
  3. Generative
  4. Sequential
  5. Others
    Cloze-Style

Extractive

Dataset Language Domain #Train
#Doc
#Dev #Test Year Features
SQuADv1 English Wikipedia 87599
442
10570
48
9533
46
2016
SQuADv2 English Wikipedia 130319
442
11873
35
8862
28
2018 have no answer
TriviaQA English Web
Wikipedia
528979
61888
68621
9951
65059
9509
2017 avg. length is 2895
HotPotQA English Wikipedia 90564 7405 7405 2017 multiple supporting doc. to answer
Natural Questions English Wikipedia 307k 8k 8k 2019 whole wikipedia article;
long answer

Multi-Choice

Dataset Language Domain #Train
#Doc
#Dev #Test Year Features
RACE English Multi 87866
25137
4887
1389
4934
1407
2017 high
middle
MCScript English InScript 9731
1470
1411
219
2797
430
2018
ARC English Science 14M/7787 2018 hard
OpenBook English 4957 500 500 2018 multi-hop;
commonsense;
science fact
MultiRC English 7 domain 9872
871
2018 multi correct answers
QAngaroo
wikihop
English 43738 5129 2451 2017 multi evidence pieces;
multi options
DREAM English 2019 dialogue-based multi-choice

Generative

Free-form answer generation

Dataset Language Domain #Train #Dev #Test Year Features
MSMARCO
v1
English Web 100k 2016
MSMARCO
v2
English Web 100k 2018
NarrativeQA English book/moive
(wikipedia)
32747 3461 10557 2018 based on summary
DuReader Chinese Web 2017
  • Metric for evaluate the Generative-style QA dataset is: ROUGE-L/BLEU

MS MARCO

  • More details about MSMARCO, please reference to this REPO

Narrative QA

  • dataset consists of two settings:
    • based on the summary: average 659 tokens
    • based on the full book/movie script: average 62528 tokens

Sequential

Dataset Language Domain Document Question Size
(Train/Dev/Test)
Year Features
QuAC English Wikipedia 2018
CoQA English Wikipedia 2018
DREAM English Daily life 6444 10197 2019 dialogue-based multi-choice;
multi-turn multi-party

Others

Chinese Datasets

Multi-Documents Datasets

  • TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension.
  • SearchQA: A new Q&A dataset augmented with context from a search engine.
  • HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.

Commonsense Reasoning

FEVER: Fact Extraction and VERification

Google Natural Questions

official details: https://ai.googleblog.com/2019/01/natural-questions-new-corpus-and.html
github: https://github.com/google-research-datasets/natural-questions

自然问题数据集(NQ)是一个公开的自然发生问题(即由寻求信息的人提出的问题)

  • 用于训练和评估开放领域问答系统的新的、大规模语料库;
  • 也是第一个复制人类查找问题答案的端到端流程的语料库;
  • 专注于通过阅读整个页面来查找答案,而不是从一个短段落中提取答案;
  • 来源于Wikipedia;
  • 人工注释答案:
    • 要求注释者通过通读整个维基百科页面来找到答案,就好像这个问题是他们自己提出的一样。
    • 注释者需要找到一个长答案和一个短答案,长答案涵盖推断问题所需的所有信息,短答案需要用一个或多个实体的名称简洁地回答问题

自然语言理解挑战:

  • NQ的目的是使QA系统能够阅读和理解完整的维基百科文章,其中可能包含问题的答案,也可能不包含问题的答案。
  • 系统首先需要确定这个问题的定义是否足够充分,是否可以回答
    • 许多问题本身基于错误的假设,或者过于模糊,无法简明扼要地回答。
  • 然后,系统需要确定维基百科页面中是否包含推断答案所需的所有信息。
  • 作者认为,相比在知道长答案后在寻找短答案,长答案识别任务(找到推断答案所需的所有信息)需要更深层次的语言理解。

human upper bound:

  • long answer selection task: 87% F1
  • short answer selection task: 76% F1

dataset statistics:

  • train: 307k
  • dev: 8k
  • test: 8k
**** END of This Post. Thank for Your READING ****
If you have any Question, welcome to Email me or leave your comments below.