Contributed by Luxi Xing and Yuqiang Xie. IIE, CAS.
Introduction
Based on the style of the ANSWER for each datasets, we split the datasets into several category and we will give details for each datasets.
- Extractive
- Multi-Choice
- Generative
- Sequential
- Others
Cloze-Style
Extractive
Dataset | Language | Domain | #Train #Doc |
#Dev | #Test | Year | Features |
---|---|---|---|---|---|---|---|
SQuADv1 | English | Wikipedia | 87599 442 |
10570 48 |
9533 46 |
2016 | |
SQuADv2 | English | Wikipedia | 130319 442 |
11873 35 |
8862 28 |
2018 | have no answer |
TriviaQA | English | Web Wikipedia |
528979 61888 |
68621 9951 |
65059 9509 |
2017 | avg. length is 2895 |
HotPotQA | English | Wikipedia | 90564 | 7405 | 7405 | 2017 | multiple supporting doc. to answer |
Natural Questions | English | Wikipedia | 307k | 8k | 8k | 2019 | whole wikipedia article; long answer |
Multi-Choice
Dataset | Language | Domain | #Train #Doc |
#Dev | #Test | Year | Features |
---|---|---|---|---|---|---|---|
RACE | English | Multi | 87866 25137 |
4887 1389 |
4934 1407 |
2017 | high middle |
MCScript | English | InScript | 9731 1470 |
1411 219 |
2797 430 |
2018 | |
ARC | English | Science | 14M/7787 | 2018 | hard | ||
OpenBook | English | 4957 | 500 | 500 | 2018 | multi-hop; commonsense; science fact |
|
MultiRC | English | 7 domain | 9872 871 |
2018 | multi correct answers | ||
QAngaroo wikihop |
English | 43738 | 5129 | 2451 | 2017 | multi evidence pieces; multi options |
|
DREAM | English | 2019 | dialogue-based multi-choice |
Generative
Free-form answer generation
Dataset | Language | Domain | #Train | #Dev | #Test | Year | Features |
---|---|---|---|---|---|---|---|
MSMARCO v1 |
English | Web | 100k | 2016 | |||
MSMARCO v2 |
English | Web | 100k | 2018 | |||
NarrativeQA | English | book/moive (wikipedia) |
32747 | 3461 | 10557 | 2018 | based on summary |
DuReader | Chinese | Web | 2017 |
- Metric for evaluate the Generative-style QA dataset is: ROUGE-L/BLEU
MS MARCO
- More details about MSMARCO, please reference to this REPO
Narrative QA
- dataset consists of two settings:
- based on the summary: average 659 tokens
- based on the full book/movie script: average 62528 tokens
Sequential
Dataset | Language | Domain | Document | Question | Size (Train/Dev/Test) |
Year | Features |
---|---|---|---|---|---|---|---|
QuAC | English | Wikipedia | 2018 | ||||
CoQA | English | Wikipedia | 2018 | ||||
DREAM | English | Daily life | 6444 | 10197 | 2019 | dialogue-based multi-choice; multi-turn multi-party |
Others
Chinese Datasets
- People Daily & Children’s Fairy Tale (PD&CFT): https://github.com/ymcui/Chinese-RC-Dataset
- cloze-style
- dureader: https://github.com/baidu/DuReader
- generative
- cmrc2018: https://github.com/ymcui/cmrc2018
- extractive
Multi-Documents Datasets
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension.
- SearchQA: A new Q&A dataset augmented with context from a search engine.
- HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.
Commonsense Reasoning
- Winograd Schema Challenge. https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html
- COPA: Choice of Plausible Alternatives. https://www.cs.york.ac.uk/semeval-2012/task7/index.html
- ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension
- CommonsenseQA
FEVER: Fact Extraction and VERification
- FEVER data: link
- FEVER shared Task(with NAACL 2018) info: details here
- FEVER official baseline code: github repo
- my note and analysis of FEVER: fever-note
Google Natural Questions
official details: https://ai.googleblog.com/2019/01/natural-questions-new-corpus-and.html
github: https://github.com/google-research-datasets/natural-questions
自然问题数据集(NQ)是一个公开的自然发生问题(即由寻求信息的人提出的问题)
- 用于训练和评估开放领域问答系统的新的、大规模语料库;
- 也是第一个复制人类查找问题答案的端到端流程的语料库;
- 专注于通过阅读整个页面来查找答案,而不是从一个短段落中提取答案;
- 来源于Wikipedia;
- 人工注释答案:
- 要求注释者通过通读整个维基百科页面来找到答案,就好像这个问题是他们自己提出的一样。
- 注释者需要找到一个长答案和一个短答案,长答案涵盖推断问题所需的所有信息,短答案需要用一个或多个实体的名称简洁地回答问题
自然语言理解挑战:
- NQ的目的是使QA系统能够阅读和理解完整的维基百科文章,其中可能包含问题的答案,也可能不包含问题的答案。
- 系统首先需要确定这个问题的定义是否足够充分,是否可以回答
- 许多问题本身基于错误的假设,或者过于模糊,无法简明扼要地回答。
- 然后,系统需要确定维基百科页面中是否包含推断答案所需的所有信息。
- 作者认为,相比在知道长答案后在寻找短答案,长答案识别任务(找到推断答案所需的所有信息)需要更深层次的语言理解。
human upper bound:
- long answer selection task: 87% F1
- short answer selection task: 76% F1
dataset statistics:
- train: 307k
- dev: 8k
- test: 8k