机器阅读理解数据集

Contributed by Luxi Xing and Yuqiang Xie. IIE, CAS.

Introduction

Based on the style of the ANSWER for each datasets, we split the datasets into several category and we will give details for each datasets.

Extractive

Multi-Choice

Generative

Sequential

Others
~~Cloze-Style~~

Dataset	Language	Domain	#Train #Doc	#Dev	#Test	Year	Features
SQuADv1	English	Wikipedia	87599 442	10570 48	9533 46	2016
SQuADv2	English	Wikipedia	130319 442	11873 35	8862 28	2018	have no answer
TriviaQA	English	Web Wikipedia	528979 61888	68621 9951	65059 9509	2017	avg. length is 2895
HotPotQA	English	Wikipedia	90564	7405	7405	2017	multiple supporting doc. to answer
Natural Questions	English	Wikipedia	307k	8k	8k	2019	whole wikipedia article; long answer

Dataset	Language	Domain	#Train #Doc	#Dev	#Test	Year	Features
RACE	English	Multi	87866 25137	4887 1389	4934 1407	2017	high middle
MCScript	English	InScript	9731 1470	1411 219	2797 430	2018
ARC	English	Science			14M/7787	2018	hard
OpenBook	English		4957	500	500	2018	multi-hop; commonsense; science fact
MultiRC	English	7 domain	9872 871			2018	multi correct answers
QAngaroo wikihop	English		43738	5129	2451	2017	multi evidence pieces; multi options
DREAM	English					2019	dialogue-based multi-choice

Free-form answer generation

Dataset	Language	Domain	#Train	#Dev	#Test	Year	Features
MSMARCO v1	English	Web			100k	2016
MSMARCO v2	English	Web			100k	2018
NarrativeQA	English	book/moive (wikipedia)	32747	3461	10557	2018	based on summary
DuReader	Chinese	Web				2017

dataset consists of two settings:
- based on the summary： average 659 tokens
- based on the full book/movie script: average 62528 tokens

Dataset	Language	Domain	Document	Question	Year	Features
QuAC	English	Wikipedia			2018
CoQA	English	Wikipedia			2018
DREAM	English	Daily life	6444	10197	2019	dialogue-based multi-choice; multi-turn multi-party

People Daily & Children’s Fairy Tale (PD&CFT): https://github.com/ymcui/Chinese-RC-Dataset
- cloze-style
dureader: https://github.com/baidu/DuReader
- generative
cmrc2018: https://github.com/ymcui/cmrc2018
- extractive

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension.
SearchQA: A new Q&A dataset augmented with context from a search engine.
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.

Winograd Schema Challenge. https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html
COPA: Choice of Plausible Alternatives. https://www.cs.york.ac.uk/semeval-2012/task7/index.html
ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension
CommonsenseQA

自然问题数据集(NQ)是一个公开的自然发生问题(即由寻求信息的人提出的问题)

用于训练和评估开放领域问答系统的新的、大规模语料库;
也是第一个复制人类查找问题答案的端到端流程的语料库;
专注于通过阅读整个页面来查找答案，而不是从一个短段落中提取答案;
来源于Wikipedia;
人工注释答案:
- 要求注释者通过通读整个维基百科页面来找到答案，就好像这个问题是他们自己提出的一样。
- 注释者需要找到一个长答案和一个短答案，长答案涵盖推断问题所需的所有信息，短答案需要用一个或多个实体的名称简洁地回答问题

自然语言理解挑战：

human upper bound:

dataset statistics: