This page organizes all corpora which have resulted from or have been used in our research. Their availability for Webis externals is as follows: (1) corpora that have been officially released by Webis as well as (2) corpora of the PAN series can be downloaded here, (3) internal Webis corpora (which will be officially released in the future) are supplied upon request, (4) other corpora can be downloaded from their original publisher/creator. Most of our released corpora are hosted at Zenodo and are indexed in the Google Dataset Search ; a few larger corpora are available in the Internet Archive ; some corpora are accessibly via the Hugging Face and IR datasets libraries; the –symbol indicates a browsing facility for the respective corpus.
Released Webis Corpora | ||||||||
---|---|---|---|---|---|---|---|---|
Name | Publisher/Creator | Year | Size [bytes] | Size | [units] | Default Task | Access | |
Archive Query Log 2022 | Webis Group | 2023 | 44 GB | 357M | queries | Query Log Analysis | ||
Arg-Microtexts Synthesis Benchmark | Webis Group | 2018 | 4 MB | 260 | arguments | Computational Argumentation | ||
args.me corpus | Webis Group | 2019 | 876 MB | 388K | arguments | Computational Argumentation | ||
ArguAna Counterargs | Webis Group | 2018 | 106 MB | 7K | arguments | Computational Argumentation | ||
ArguAna TripAdvisor | Webis Group & FG Engels | 2014 | 283 MB | 2K | reviews | Sentiment Analysis | ||
BuzzFeed-Webis Fake News Corpus 16 | Webis Group | 2018 | 5 GB | 1K | articles | News analysis | ||
CauseNet-20 | Webis Group & Data Science Group | 2020 | 2 GB | 12M | relations | Causal Relation Analysis | ||
CommonCrawl News Articles by Political Orientation | Webis Group | 2022 | 4 GB | - | - | Media Bias detection, Social Bias detection | ||
CompArg: Comparative Sentences 2019 | Universität Hamburg | 2019 | 3 MB | - | - | Comparative Sentences Classification | ||
Dagstuhl-15512-ArgQuality | Dagstuhl-15512 Quality breakout group | 2017 | 1 MB | 304 | arguments | Computational Argumentation | ||
Genre-KI-04 | Webis Group | 2004 | 11 MB | 1K | documents | Web Genre Analysis | ||
IR Benchmarks | Webis Group | 2023 | - | 2K | runs | Leaderboards | ||
LFA-11 | Webis Group & FG Engels | 2011 | 5 MB | - | - | Genre and Sentiment Analysis | ||
Paderborn Genre Analysis Corpus 2012 | Baumann, Lettmann, Stein | 2012 | 20 MB | - | - | Web Genre Analysis | ||
SCAI-QReCC-21 | Webis Group | 2023 | 244 MB | 14K | conversations | Conversational Analysis (written) | ||
SMAuC – The Scientific Multi-Authorship Corpus | Webis Group | 2023 | 51 GB | 22K | documents | Authorship Analytics | ||
TexBiG | Tschirschwitz, Klemstein, Stein, Rodehorst | 2022 | 15 GB | 52K | images | Document Layout Analysis | ||
WDVC-15 | FG Engels & Webis Group | 2015 | 5 GB | 24M | revisions | Vandalism Detection | ||
WDVC-16 | FG Engels & Webis Group | 2016 | 30 GB | 83M | revisions | Vandalism Detection | ||
Webis Chatnoir-Copycat 2021 | Webis Group | 2021 | 91 TB | 7B | documents | Duplicate Detection | ||
Webis MS MARCO Anchor Text 2022 | Webis Group | 2022 | 4 GB | 7M | documents | Anchor Text | ||
Webis-Ambient-15 | Webis Group | 2015 | 114 MB | 6K | documents | Clustering/Cluster Labeling | ||
Webis-ArgImages-21 | Webis Group | 2021 | 1 MB | 3K | images | Computational Argumentation | ||
Webis-ArgKB-20 | Webis Group | 2020 | 1 MB | 5K | argumentative relations | Computational Argumentation | ||
Webis-ArgQuality-20 | Webis Group | 2020 | 3 MB | 1K | arguments | Computational Argumentation | ||
Webis-ArgRank-17 | Webis Group | 2017 | 13 MB | 18K | arguments | Computational Argumentation | ||
Webis-Argument-Attributes | Webis Group & DRL Potsdam | 2020 | 1 KB | 20 | attributes | Computational Argumentation | ||
Webis-Argument-Framing-19 | Webis Group | 2019 | 7 MB | 12K | arguments | Computational Argumentation and Framing | ||
Webis-ArgValues-22 | Webis Group | 2022 | 1 MB | 5K | arguments | Human Value Detection | ||
Webis-Bias-Flipper-18 | Webis Group | 2018 | 13 MB | 6K | documents | Natural Language Generation | ||
Webis-CausalQA-22 | Webis Group | 2022 | 17 GB | 1M | question-answer pairs | Causal Question Answering | ||
Webis-Clickbait-16 | Webis Group | 2016 | 255 MB | 3K | tweets | Clickbait Detection | ||
Webis-Clickbait-17 | Webis Group | 2017 | - | 20K | tweets | Clickbait Detection | ||
Webis-Clickbait-22 | Webis Group | 2022 | 10 MB | 5K | posts | Clickbait Spoiling | ||
Webis-CLS-10 | Webis Group | 2010 | 530 MB | 800K | documents | Cross-Language Text Classification | ||
Webis-CMV-20 | Webis Group | 2020 | 3 GB | - | argument pairs | Computational Argumentation | ||
Webis-CompQuestions-20 | Webis Group | 2020 | 1 MB | 15K | questions | Comparative Question Classification | ||
Webis-CompQuestions-22 | Webis Group | 2022 | 5 MB | 31K | questions | Comparative Question Classification | ||
Webis-ConcluGen-21 | Webis Group | 2021 | 225 MB | 136K | argument-conclusion pairs | Informative Conclusion Generation, Text Summarization | ||
Webis-Context-sensitive-Word-Search-Queries-2022 | Webis Group | 2022 | 489 MB | 24M | queries | Context-sensitive Word Search | ||
Webis-Conversational-Query-Reformulations-21 | Webis Group | 2021 | 193 KB | 3K | messages | Query classification | ||
Webis-CPC-11 | Webis Group | 2011 | 19 MB | 8K | paraphrases | Plagiarism Detection | ||
Webis-Dataset-Reviews-21 | Webis Group | 2021 | 43 MB | 539K | dataset mentions | Dataset Search | ||
Webis-Debate-16 | Webis Group | 2016 | 908 KB | 27K | text segments | Computational Argumentation | ||
Webis-Editorial-Quality-18 | Webis Group | 2018 | 3 MB | 1K | documents | Computational Argumentation | ||
Webis-Editorials-16 | Webis Group | 2016 | 5 MB | 300 | documents | Computational Argumentation | ||
Webis-EditorialSum-20 | Webis Group | 2020 | 10 MB | 1K | editorials | Text Summarization | ||
Webis-Exhibition-Questions-21 | Webis Group | 2021 | 34 MB | 849 | questions | Conversational Analysis (written) | ||
Webis-Generated-Game-Art-23 | Webis Group | 2023 | 117 MB | 110 | images | Image Generation | ||
Webis-Gmane-19 | Webis Group | 2019 | 160 GB | 153M | emails | Dialog Analysis | ||
Webis-Health-CauseNet-22 | Webis Group | 2022 | 1 GB | 8M | sentences | Health Causal Relation Analysis | ||
Webis-Health-Misbeliefs-21 | Webis Group | 2021 | 200 KB | - | terms | Query Analysis | ||
Webis-KIQC-13 | Webis Group | 2013 | 1 MB | 3K | questions | Known-Item Search | ||
Webis-Mnemonics-17 | Webis Group | 2017 | 2 MB | 1K | mnemonics | Password analysis | ||
Webis-News-Bias-20 | Webis Group | 2020 | 14 MB | 7K | articles | News analysis, Media Bias detection | ||
Webis-NIL-21 | Webis Group | 2021 | 392 KB | 37K | log entries | Query identification | ||
Webis-Nudged-Questions-23 | Webis Group | 2023 | 125 MB | 9K | questions | Conversational Analysis | ||
Webis-ODP-10 | Webis Group | 2010 | 113 MB | 5M | documents | Clustering/Cluster Labeling | ||
Webis-PC-08 | Webis Group | 2008 | 298 MB | - | - | Plagiarism Detection | ||
Webis-Persuasive-Debaters-on-Reddit-CMV-2022 | Webis Group | 2022 | 492 MB | 4K | debaters | Persuavsiveness Analysis | ||
Webis-PRA-12 | Webis Group | 2012 | 884 KB | 14K | company names | Spelling Error Detection | ||
Webis-QInC-22 | Webis Group | 2022 | 79 MB | 13 MB | queries | Query Interpretation | ||
Webis-QSeC-10 | Webis Group | 2010 | 2 MB | - | - | Query Segmentation | ||
Webis-QSpell-17 | Webis Group | 2017 | 1 MB | - | - | Query Spelling Correction | ||
Webis-QTM-19 | Webis Group | 2019 | 2 MB | 200K | Queries | Query-task mapping | ||
Webis-Revenue-10 | FG Engels & Webis Group | 2010 | 6 MB | 1K | documents | Entity and Relation Extraction | ||
Webis-SameSentiment-21 | Webis Group | 2021 | 43 MB | 704K | sentiment pair ids | Sentiment Analysis | ||
Webis-SameSide-19 | Webis Group | 2020 | 63 MB | 125K | argument pairs | Computational Argumentation | ||
Webis-SameSide-21 | Webis Group | 2021 | 150 MB | - | argument pairs | Computational Argumentation | ||
Webis-SameSideAdversarial-21 | Webis Group | 2021 | 50 KB | 175 | argument pairs | Computational Argumentation | ||
Webis-SCSmeta-21 | Webis Group | 2021 | 25 KB | 1K | turns | Conversational Analysis (spoken) | ||
Webis-SDMbridge-12 | Webis Group | 2012 | 58 MB | 15K | models | Simulation Data Mining | ||
Webis-Sentences-17 | Webis Group | 2017 | 200 GB | 3B | sentences | Text statistics | ||
Webis-SMC-12 | Webis Group | 2012 | 123 KB | - | - | Search Mission Detection | ||
Webis-Snippet-20 | Webis Group | 2020 | 11 GB | 10M | snippet-webpage pairs | Abstractive Snippet Generation, Text Summarization | ||
Webis-STEREO-21 | Webis Group | 2021 | 8 GB | 91M | cases | Text Reuse Detection | ||
Webis-TLDR-17 | Webis Group | 2017 | 2 GB | 4M | content-summary pairs | Text Summarization | ||
Webis-TRC-12 | Webis Group | 2012 | 120 MB | 150 | interaction logs | Text Reuse Detection, Paraphrasing, and Exploratory Search | ||
Webis-Trigger-Warning-Corpus-22 | Webis Group | 2023 | 54 GB | 1M | documents | Multi Label Document Classification | ||
Webis-Tripad-13-Sentiment | Webis Group | 2013 | 3 MB | 2K | reviews | Sentiment Analysis | ||
Webis-Tripad-14 | Webis Group | 2014 | 61 MB | 266K | reviews | Sentiment Analysis and Author Profiling | ||
Webis-Voice-based-and-Conversational-Argument-Search-20 | Webis Group | 2020 | 350 KB | 500 | participants | Conversational Analysis (spoken) | ||
Webis-Web-Archive-17 | Webis Group | 2017 | 94 GB | 10K | documents | Web Analysis | ||
Webis-Web-Archive-Quality-22 | Webis Group | 2022 | 18 GB | 7K | documents | Web Analysis | ||
Webis-Web-Errors-19 | Webis Group | 2019 | 1 MB | 10K | documents | Web Analysis | ||
Webis-WebSeg-20 | Webis Group | 2020 | 12 GB | 8K | documents | Web Page Segmentation | ||
Webis-WebSeg-20-Algorithm-Segmentations | Webis Group | 2021 | 7 GB | 246K | segmentations | Web Page Segmentation | ||
Webis-WikiDebate-18 | Webis Group | 2018 | 78 MB | 6M | discussions | Computational Argumentation | ||
Webis-WikiDiscussions-18 | Webis Group | 2018 | 4 GB | 6M | discussions | Computational Argumentation | ||
Webis-Wikipedia-IPC-23 | Webis Group | 2023 | 52 MB | 916K | paraphrase pairs | Paraphrasing | ||
Webis-Wikipedia-Text-Reuse-18 | Webis Group | 2018 | - | - | text segments | Text Reuse Analysis | ||
Webis-WVC-07 | Webis Group | 2007 | 12 KB | 1K | documents | Vandalism Detection | ||
Webis-YouTube8MA-18 | Webis Group | 2018 | 169 GB | 6M | documents | Video Retrieval |
PAN Corpora | ||||||||
---|---|---|---|---|---|---|---|---|
Name | Publisher/Creator | Year | Size [bytes] | Size | [units] | Default Task | Access | |
Alvi15-Text-Alignment-en-fa | Webis Group | 2015 | 2 MB | 200 | documents | Originality | ||
C10-Attribution | Webis Group | 2015 | 4 MB | - | - | Author Identification | ||
C50-Attribution | Webis Group | 2015 | 17 MB | - | - | Author Identification | ||
Cheema15-Text-Alignment-en | Webis Group | 2015 | 4 MB | - | - | Originality | ||
FIRE14-SOurce-COde-Re-use | PAN | 2014 | 16 MB | - | - | Originality | ||
Hanfi15-Text-Alignment-en-ur | Webis Group | 2015 | 3 MB | - | - | Originality | ||
Khoshnavataher15-Text-Alignment-fa | Webis Group | 2015 | 16 MB | - | - | Originality | ||
Kong15-Text-Alignment-zh | Webis Group | 2015 | 3 MB | - | - | Originality | ||
Mohtaij15-Text-Alignment-en | Webis Group | 2015 | 57 MB | - | - | Originality | ||
Palkovskii15-Text-Alignment-en | Webis Group | 2015 | 26 MB | - | - | Originality | ||
PAN-PC-09 | Webis Group | 2009 | 2 GB | 41K | documents | Plagiarism Detection | ||
PAN-PC-10 | Webis Group | 2010 | 2 GB | 27K | documents | Plagiarism Detection | ||
PAN-PC-11 | Webis Group | 2011 | 2 GB | 27K | documents | Plagiarism Detection | ||
PAN-SemEval-Hyperpartisan-News-Detection-19 | Webis & Factmata | 2018 | 1 GB | 751K | articles | Hyperpartisan News Detection | ||
PAN-WQF-12 | Webis Group | 2012 | 4 GB | 2M | documents | Quality Flaw Prediction | ||
PAN-WVC-10 | Webis Group | 2010 | 439 MB | 32K | documents | Vandalism Detection | ||
PAN-WVC-11 | Webis Group | 2011 | 371 MB | 24K | documents | Vandalism Detection | ||
PAN11-Attribution | Webis Group | 2011 | 3 MB | - | - | Author Identification | ||
PAN12-Attribution | Webis Group | 2012 | 9 MB | - | - | Author Identification | ||
PAN12-Sexual-Predator-Identification | Webis Group | 2012 | 92 MB | - | - | Deception Detection | ||
PAN12-Source-Retrieval | Webis Group | 2012 | 1 MB | - | - | Originality | ||
PAN12-Text-Alignment | Webis Group | 2012 | 783 MB | - | - | Originality | ||
PAN13-Author-Profiling | Webis Group | 2013 | 713 MB | - | - | Author Profiling | ||
PAN13-Source-Retrieval | Webis Group | 2013 | 3 MB | - | - | Originality | ||
PAN13-Text-Alignment | Webis Group | 2013 | 35 MB | - | - | Originality | ||
PAN13-Verification | Webis Group | 2013 | 1 MB | - | - | Author Identification | ||
PAN14-Author-Profiling | Webis Group | 2014 | 205 MB | - | - | Author Profiling | ||
PAN14-Source-Retrieval | Webis Group | 2014 | 7 MB | - | - | Originality | ||
PAN14-Text-Alignment | Webis Group | 2014 | 22 MB | - | - | Originality | ||
PAN14-Verification | Webis Group | 2014 | 9 MB | - | - | Author Identification | ||
PAN15-Author-Profiling | Webis Group | 2015 | 2 MB | - | - | Author Profiling | ||
PAN15-Source-Retrieval | Webis Group | 2015 | 7 MB | - | - | Originality | ||
PAN15-Verification | Webis Group | 2015 | 3 MB | - | - | Author Identification | ||
PAN16-Author-Masking | PAN | 2016 | 2 MB | 205 | cases | Author Obfuscation | ||
PAN16-Author-Profiling | Webis Group | 2016 | 2 MB | - | - | Author Profiling | ||
PAN16-Clustering | Webis Group | 2016 | 3 MB | - | - | Author Identification | ||
PAN17-Author-Profiling | Webis Group | 2017 | 254 MB | - | - | Author Profiling | ||
PAN17-Clustering | Webis Group | 2017 | 1 MB | - | - | Author Identification | ||
PAN17-Style-Change-Detection | Webis Group | 2017 | 8 MB | - | - | Multi-Author Analysis | ||
PAN18-Attribution | Webis Group | 2018 | 4 MB | 2K | cases | Author Identification | ||
PAN18-Author-Profiling | PAN | 2018 | 7 GB | 8K | cases | Author Profiling | ||
PAN18-Style-Change-Detection | Webis Group | 2018 | 8 MB | 3K | cases | Multi-Author Analysis | ||
PAN19-Attribution | Webis Group | 2019 | 13 MB | - | - | Author Identification | ||
PAN19-Bots-and-Gender-Profiling | Webis Group | 2019 | 38 MB | - | - | Author Profiling | ||
PAN19-Celebrity-Profiling | Webis Group | 2019 | 3 GB | - | - | Author Profiling | ||
PAN19-Style-Change-Detection | Webis Group | 2019 | 10 MB | - | - | Multi-Author Analysis | ||
PAN20-Authorship-Verification | Webis Group | 2020 | 838 MB | - | - | Authorship Verification | ||
PAN20-Authorship-Verification (Large) | Webis Group | 2020 | 4 GB | - | - | Authorship Verification | ||
PAN20-Celebrity-Profiling | Webis Group | 2020 | 7 GB | - | - | Author Profiling | ||
PAN20-Profiling-Fake-News-Spreaders-in-Twitter | Webis Group | 2020 | 8 MB | - | - | Author Profiling | ||
PAN20-Style-Change-Detection | Webis Group | 2020 | 98 MB | - | - | Multi-Author Analysis | ||
PAN21-Authorship-Verification | Webis Group | 2021 | 322 MB | - | - | Authorship Verification | ||
PAN21-Profiling-Hate-Speech-Spreaders-on-Twitter | Webis Group | 2021 | 3 MB | - | - | Author Profiling | ||
PAN21-Style-Change-Detection | Webis Group | 2021 | 19 MB | - | - | Multi-Author Analysis | ||
PAN22-Authorship-Verification | Webis Group | 2022 | 23 MB | - | - | Authorship Verification | ||
PAN22-Profiling-Irony-and-Stereotype-Spreaders-on-Twitter | Webis Group | 2022 | 6 MB | - | - | Author Profiling | ||
PAN22-Style-Change-Detection | Webis Group | 2022 | 28 MB | - | - | Multi-Author Analysis | ||
PAN23-Multi-Author-Writing-Style-Analysis | Webis Group | 2023 | 26 MB | - | Reddit comments | Multi-Author Analysis | ||
PAN23-Profiling-Cryptocurrency-Influencers-with-Few-shot-Learning | Symanto Research | 2023 | 202 KB | - | Tweets | Author Profiling | ||
PAN23-Trigger-Detection | Webis Group | 2023 | 2 GB | 341K | fanworks | Trigger Detection | ||
Scientific Author's Writing Style Corpus 2017 | Rexha, Kröll, Ziak, Kern | 2017 | - | 66 | cases | Authorship Attribution |
Touché Corpora | ||||||||
---|---|---|---|---|---|---|---|---|
Name | Publisher/Creator | Year | Size [bytes] | Size | [units] | Default Task | Access | |
Touché20-Argument-Retrieval-for-Comparative-Questions | Webis Group | 2020 | 3 MB | 50 | topics | Argument search | ||
Touché20-Argument-Retrieval-for-Controversial-Questions | Webis Group | 2020 | 9 MB | 50 | topics | Argument search | ||
Touché21-Argument-Retrieval-for-Comparative-Questions | Webis Group | 2021 | 200 KB | 50 | topics | Argument search | ||
Touché21-Argument-Retrieval-for-Controversial-Questions | Webis Group | 2021 | 1 MB | 50 | topics | Argument search | ||
Touché22-Argument-Retrieval-for-Comparative-Questions | Webis Group | 2022 | 700 MB | 50 | topics | Argument search | ||
Touché22-Argument-Retrieval-for-Controversial-Questions | Webis Group | 2022 | 2 GB | 50 | topics | Argument search | ||
Touché22-Image-Retrieval-for-Arguments | Webis Group | 2022 | 169 GB | 24K | images | Image search | ||
Touché23-Argument-Retrieval-for-Controversial-Questions | Webis Group | 2023 | 1 MB | 50 | topics | Argument search | ||
Touché23-Evidence-Retrieval-for-Causal-Questions | Webis Group | 2023 | 1 MB | 50 | topics | Causal retrieval | ||
Touché23-Image-Retrieval-for-Arguments | Webis Group | 2023 | 1 TB | 56K | images | Image search | ||
Touché23-ValueEval | Webis Group | 2023 | 1 MB | 9K | arguments | Human Value Detection |
Other Corpora | ||||||||
---|---|---|---|---|---|---|---|---|
Name | Publisher/Creator | Year | Size [bytes] | Size | [units] | Default Task | Access | |
20 Newsgroups | Carnegie Mellon University | 1999 | 18 MB | 20K | documents | Text Classification, Text Clustering | ||
7Sectors-WebKB | CMU World Wide Knowledge Base | 2001 | 6 MB | 5K | documents | Text Classification, Text Clustering | ||
A Corpus of Plagiarised Short Answers | University of Sheffield | 2009 | 80 KB | 100 | documents | Plagiarism Detection | ||
ABCD (Agreement By Create Debaters) | Sara Rosenthal | 2015 | 42 MB | 10K | dialogues | Conversation Analysis (written, human-human) | ||
AgreeSum | New York University | 2021 | 12 MB | 18K | multiple articles-summary pairs | Text Summarization, Multi-document | ||
All The News | Kaggle | 2020 | 3 GB | 3M | news articles | Text Summarization, Text Analysis | ||
Annotated Customer Reviews | Simon Fraser University Burnaby | 2004 | 870 KB | - | - | Sentiment Analysis | ||
Any-Aspect Summarization | Carnegie Mellon University | 2020 | 2 GB | 280K | article-summary pairs | Text Summarization | ||
AOL Query Log | AOL | 2006 | 2 GB | 112M | queries | Query Log Analysis | ||
Araucaria Argumentation Corpus | University of Dundee | 2014 | 9 MB | 664 | examples | Computational Argumentation | ||
Arguing Subjectivity Corpus | University of Pittsburgh | 2012 | 732 KB | 84 | documents | Computational Argumentation | ||
Argument Annotated Essays, v1 | TU Darmstadt | 2014 | 5 MB | 90 | essays | Computational Argumentation | ||
Argument Annotated Essays, v2 | TU Darmstadt | 2016 | 2 MB | 402 | essays | Computational Argumentation | ||
Argument Aspect Corpus | Leibniz-Institute for Media Research / Hans-Bredow Institute | 2022 | 2 MB | - | arguments,chunks | Computational Argumentation | ||
Arxiv-PubMed Corpus | Georgetown University | 2018 | 4 GB | 350K | article-abstract pairs | Text Summarization, Scientific Document Summarization | ||
AWTP (Agreement in Wikipedia Talk Pages) | Sara Rosenthal | 2012 | 235 KB | 822 | dialogues | Conversation Analysis (written, human-human) | ||
Bergsma-Wang-Corpus 2007 | S. Bergsma and Q. I. Wang | 2007 | 2 MB | 2K | queries | Web Search Analysis | ||
BigPatent Summarization Corpus | Khoury College of Computer Sciences | 2019 | 6 GB | 1M | article-summary pairs (US patents) | Text Summarization | ||
Bill Summarization Corpus | FiscalNote Research | 2019 | 64 MB | 22K | article-summary pairs (US bills) | Text Summarization | ||
BLOGS06 test collection | University of Glasgow | 2006 | - | 4M | documents | Link Analysis | ||
BNC Writing Errors | J. Wagner et al. | 2007 | 274 MB | - | - | Writing Error Detection | ||
British National Corpus (XML) | BNC Consortium | 2007 | 5 GB | 4K | texts | Text Analysis (English) | ||
Brown Corpus | Brown University | 2011 | 22 MB | 500 | documents | Text Analysis (English) | ||
Burrows Authorship Corpora | Steven Burrows, RMIT University | 2010 | 8 MB | - | - | Source Code Authorship Attribution | ||
CEEAUS 2010 Beta Edition | Kobe University | 2010 | - | 2K | documents | Cross-Language Analysis | ||
Change My View Modes | Columbia University | 2017 | - | 78 | discussion threads | Computational Argumentation | ||
CLEANEVAL 2007 | University of Trento and University of Leeds | 2007 | 15 MB | 1K | documents | Main Content Extraction | ||
CLEF-IP 2009 | Information Retrieval Facility Society (IRF) | 2009 | 14 GB | 2M | documents | Patent Retrieval | ||
CLEF-IP 2010 | Information Retrieval Facility Society (IRF) | 2010 | 9 GB | 3M | documents | Patent Retrieval | ||
ClueWeb09 | Carnegie Mellon University | 2009 | 4 TB | 1B | web pages | Web Mining | ||
ClueWeb12 | Carnegie Mellon University | 2012 | 5 TB | 733M | web pages | Web Mining | ||
CNN-DailyMail | IBM | 2016 | 1 GB | 200K | article-summary pairs | Text Summarization | ||
Common Crawl | Common Crawl organization | 2009-2021 (+) | 2 PB | 3M | WARC files | Web Analysis | ||
CoNLL-2003 | University of Antwerpen | 2003 | 12 MB | - | - | Named Entity Recognition | ||
ConvoSumm Corpus | Yale University | 2021 | 650 MB | 500 | comments-summary pairs | Text Summarization, Dialogue Summarization | ||
CoPhIR | Consiglio Nazionale delle Ricerche (ISTI-CNR) | 2003 | 54 GB | 106M | images | Image Retrieval | ||
CORE | The Open University | 2018 | 330 GB | 123M | documents | Data Mining | ||
DBLP | University of Massachusetts Amherst | 2006 | 910 MB | - | - | Network Analysis | ||
Dbpedia 3.5 | DBpedia | 2010 | 8 GB | - | - | Data Mining | ||
DialogSum Corpus | Zhejiang University | 2021 | 4 MB | 13K | dialogue-summary pairs with topics | Text Summarization, Dialogue Summarization | ||
DMOZ | Open Directory Project | 2010 | 11 GB | - | - | Clustering and Clusterlabeling and Data Mining | ||
DoQA | Ixa | 2020 | 4 MB | 2437 | dialogues | Conversation Analysis (written, human-human) | ||
ECML PKDD Discovery Challenge 2008 | ECML | 2008 | 304 MB | 17M | lines | Collaborative Filtering and Spam Detection | ||
ESL 123 Mass Noun Examples | Microsoft Corporation | 2006 | 204 KB | 123 | sentences | Cross-Language Analysis | ||
Essay Argument Strength | UT Dallas | 2015 | 30 KB | 1K | scores | Essay scoring | ||
Essay Organization | UT Dallas | 2010 | 30 KB | 1K | scores | Essay scoring | ||
Essay Prompt Adherence | UT Dallas | 2014 | 38 KB | 830 | scores | Essay scoring | ||
Essay Thesis Clarity | UT Dallas | 2013 | 6 MB | 830 | scores | Essay scoring | ||
Europarl (v1 & v3) | University of Edinburgh | 2007 | 3 GB | - | - | Machine Translation | ||
European Corpus Initiative Multilingual Corpus I | European Corpus Initiative | 1994 | 824 MB | 49M | words | Text Analysis (Multilingual) | ||
Falko Essaykorpus L2 V2 | Institut für deutsche Sprache und Linguistik | 2005 | 5 MB | 248 | documents | Interlanguage Analysis | ||
Finegrained Sentiment | Uppsala University | 2011 | 4 MB | 294 | reviews | Sentiment Analysis | ||
General Inquirer Dictionary | Harvard University | 1966 | 4 MB | 182 | categories | Sentiment Analysis | ||
Google Books N-Gram 20090715 | 2009 | 898 GB | - | - | Data Mining | |||
Google Web 1T 5-gram Version 1 | 2006 | 55 GB | 5B | n-grams | Text Analysis (English) | |||
IBM Debater- Claim Sentences Search | IBM | 2018 | 600 MB | 2M | topic conclusion pairs | Argument Search | ||
IBM Debater- Claim Stance Dataset | IBM | 2017 | 8 MB | 2K | topic conclusion | Stance Classification | ||
IBM Debater- Claims and Evidence, ACL-14 | IBM | 2014 | 3 MB | 1K | topic argument pairs | Argument Mining | ||
IBM Debater- Claims and Evidence, EMNLP-2015 | IBM | 2015 | 8 MB | 5K | topic argument pairs | Argument Mining | ||
IBM Debater- Evidence Sentences | IBM | 2018 | 3 MB | 6K | topic premise pairs | Argument Search | ||
IBM Debater- Mention Detection Benchmark | IBM | 2018 | 2 MB | 3K | sentences | Mention Detection | ||
IBM Debater- Recorded Debating Dataset | IBM | 2018 | 2 MB | 60 | discussions | Computational Argumentation | ||
IBM Debater- Sentiment Composition Lexicon | IBM | 2018 | 10 MB | 66K | words | Sentiment Analysis | ||
IBM Debater- Sentiment Lexicon of Idiomatic Expressions | IBM | 2018 | 3 MB | 5K | phrases | Sentiment Analysis | ||
IBM Debater- TR9856 | IBM | 2015 | 2 MB | 10K | phrase pairs | Semantic Relatedness | ||
IBM Debater- Wikipedia Category Stance | IBM | 2018 | 1 MB | 5K | wikipedia category | Stance Classification | ||
IBM Debater- Word | IBM | 2018 | 4 MB | 19K | wikipedia concept pairs | Semantic Relatedness | ||
ICWSM 2009 Data Challenge | ICWSM | 2009 | 37 GB | - | - | Network Analysis | ||
imat2009 dataset | Yandex | 2009 | 650 MB | - | - | Machine-learned Ranking | ||
Intelligence Squared Debates (IQ2) | Zhang et al. | 2016 | 4 MB | 108 | dialogues | Conversation Analysis (spoken, human-human) | ||
International Corpus of Learner English v2 | Center for English Corpus Linguistics | 2009 | 92 MB | 6K | documents | Language Analysis | ||
Internet Archive | Internet Archive organization | - | 350 TB | 800K | WARC files | Web Analysis | ||
Internet Argument Corpus v2 | NLDS@UC Santa Cruz | 2016 | 3 GB | 11K | dialogues | Conversation Analysis (written, human-human) | ||
IP2Location LITE databases 2016-20 | IP2Location | 2016-2019 | 5 GB | 5 | years | IP-geolocation and proxies | ||
Key-value Retrieval Dataset | Stanford University | 2017 | 1 MB | 3K | dialogues | Conversation Analysis (written, human-wizard) | ||
Koppel Authorship Corpus | M. Koppel and J. Schler | 2004 | 4 MB | - | - | Authorship Verification | ||
Learning To Rank 3 | Microsoft | 2008 | 8 GB | - | - | Machine-learned Ranking | ||
Lee 50 Documents | M. D. Lee et al. | 2005 | 130 KB | 50 | documents | Text Similarity Analysis | ||
Maluuba Frames | Maluuba (Microsoft) | 2017 | 4 MB | 1K | dialogues | Conversation Analysis (written, human-wizard) | ||
MANtIS | Lambda-Lab at TU Delft | 2019 | 6 GB | 80K | dialogues | Conversation Analysis (written, human-human) | ||
MediaSum Corpus | Microsoft Cognitive Services Research Group | 2021 | 2 GB | 463K | interview transcript-summary pairs | Text Summarization, Dialogue Summarization | ||
MEDLINE-PubMed Corpus | University of Zürich | 2018 | 7 GB | 5M | article-abstract & abstract-title pairs | Text Summarization, Scientific Document Summarization | ||
METER Corpus | Department of Journalism and Department of Computer Science at Sheffield University | 2002 | 10 MB | - | - | Text Reuse | ||
MIR Flickr 2008 | LIACS Medialab at Leiden University, Netherlands | 2008 | 3 GB | 25K | documents | Image Retrieval | ||
MISC | Microsoft | 2017 | 23 GB | 110 | dialogues | Conversation Analysis (spoken, human-human) | ||
Montclair Electronic Language Database | Montclair State University | 2001 | 56 KB | 33 | documents | Cross-Language Analysis | ||
Movie Review Data | Cornell University | 2004-2005 | 219 MB | 12K | reviews | Sentiment Analysis | ||
Movielens | University of Minnesota | 1998-2009 | 74 MB | 11M | ratings | Collaborative Filtering | ||
MPC (Multi-Party Chat) | Shaikh et al. | 2010 | 2 MB | 14 | dialogues | Conversation Analysis (written, human-human) | ||
MSMARCO Conversational Search | Microsoft | 2019 | 1 GB | 2M | synthetic search sessions | Next Query Prediction | ||
Multi Domain Sentiment Dataset (Processed ACL) | John Hopkins University | 2007 | 29 MB | - | - | Sentiment Analysis | ||
Multi-Aspect Summarization | Amazon Research | 2019 | 946 MB | 280K | article-summary pairs | Text Summarization | ||
Multi-News | Yale University | 2019 | 676 MB | 54K | multiple articles-summary pairs | Text Summarization, Multi-document | ||
Multi-XScience | Mila | 2020 | 61 MB | 40K | article-summary pairs | Text Summarization, Scientific Document Summarization | ||
Multilingual Amazon Reviews | P. Keung et al. | 2020 | 640 MB | 1M | reviews | Text Classification (Multilingual) | ||
MultiWOZ 2.1 | M. Eric et al. | 2020 | 19 MB | 10K | dialogues | Conversation Analysis (written, human-wizard) | ||
NBC 2016 Russian Troll Tweets | NBC | 2018 | 34 MB | 267K | tweets | Propaganda detection | ||
Netflix Challenge (Partial) | Netflix | 2006 | 2 GB | - | - | Collaborative Filtering | ||
New York Times Corpus | New York Times | 2008 | 3 GB | 2M | articles | Text Mining | ||
Newsroom | Cornell University | 2018 | 5 GB | 1M | article-summary pairs | Text Summarization | ||
ODP239 | C. Carpineto and G. Romano | 2009 | 5 MB | - | - | Subtopic Information Retrieval | ||
OHSUMED Test Collection | Oregon Health & Science University | 1994 | 461 MB | - | - | Text Clustering | ||
OpenWebText Corpus | Brown University | 2019 | 40 GB | 8M | documents | Language Modeling, Text Synthesis | ||
OPUS (Europarl3_0b and EMEA0) | Jörg Tiedemann | 2009 | 9 GB | 22 | languages | Machine Translation | ||
OR-QuAC | C. Qu et al. | 2020 | 10 GB | 6K | dialogues | Conversation Analysis (written, human-wizard), Question Answering | ||
PRESTO | 2022 | 397 M | 550K | dialogues | Conversation Analysis (written, human-system) | |||
QuAC | E. Choi et al. | 2018 | 75 MB | 14K | dialogues | Conversation Analysis (written, human-wizard), Question Answering | ||
RadioTalk | Laboratory for Social Machines, MIT Media Lab | 2019 | 9 GB | 3B | words | Language Analysis | ||
Reason Identification and Classification Dataset | UT Dallas | 2014 | 4 MB | - | - | Computational Argumentation | ||
Reddit TIFU corpus | Seoul National University | 2019 | 640 MB | 123K | content-summary pairs | Text Summarization | ||
Request For Comments Collections (to 4501) | RFC Editor | 2008 | 55 MB | 4K | documents | Data Mining | ||
Reuters 21578 (22173) | Reuters, David D. Lewis | 1996 | 8 MB | 22K | articles | Text Clustering | ||
Reuters RCV1 | Reuters, David D. Lewis | 2000 | 1 GB | 365 | documents | Text Clustering | ||
Reuters RCV1 - CCAT split | Reuters, David D. Lewis | 2002 | 2 GB | - | - | Machine Learning | ||
Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test Collection | National Research Council of Canada | 2009 | 166 MB | - | - | Cross-Language Categorization | ||
Rovereto Twitter N-Gram Corpus | University of Trento, Italy | 2011 | 5 GB | 75M | tweets | Social Network Analysis | ||
ScisummNet Corpus | Yale University | 2019 | 15 MB | 1000 | scientific paper-summary pairs (with citation networks) | Text Summarization, Scientific Document Summarization | ||
SILS Learner Corpus of English | Waseda University | 2007 | 16 MB | - | - | Cross-Language Analysis | ||
SMS Spam Collection v | T. A. Almeida and J. M. G. Hidalgo | 2011 | 210 KB | 6K | messages | Spam Identification | ||
Spoken Conversational Search Data Set | J.R. Trippas et al. | 2017 | 260 KB | 101 | dialogues | Conversation Analysis (written, human-human) | ||
Spotify Podcasts Dataset | Clifton et al. | 2020 | 2 TB | 50K | hours | Conversation Analysis (spoken, human-human) | ||
SumPubMed Corpus | University of Utah | 2021 | 608 MB | 33K | scientific paper-summary pairs | Text Summarization, Scientific Document Summarization | ||
TED-LIUM Release 3 | Ubiqus and LIUM | 2018 | 50 GB | 452 | hours | Speech Recognition | ||
The JRC-Acquis Multilingual Parallel Corpus (3) | European Commission's Office for Official Publications (OPOCE) | 2009 | 2 GB | - | - | Cross-Language Research | ||
TIPSTER Complete | Advanced Research Projects Agency | 1993 | 1 MB | - | - | Information Retrieval | ||
Topical Chat Dataset | Amazon | 2019 | 76 MB | 11K | dialogues | Conversation Analysis (written, human-human) | ||
TREC vol4 | National Institute of Standards and Technology (NIST) | 1996 | 436 MB | 295K | documents | Data Mining | ||
TREC vol5 | National Institute of Standards and Technology (NIST) | 1997 | 389 MB | 260K | documents | Data Mining | ||
TREC web | National Institute of Standards and Technology (NIST) | 1999-2004 | 90 GB | - | - | Data Mining | ||
TripAdvisor Data Set | University of Illinois at Urbana-Champaign | 2010 | 220 MB | - | - | Opinion Mining | ||
Tswana Learner English Corpus | Center for Text Technology | 2006 | 2 MB | - | - | Cross-Language Analysis | ||
Twitter tweets | Yang and Leskovec | 2011 | 26 GB | 467M | tweets | Social Network Analysis | ||
Twitter tweets (RecSys Challenge) | 2020 | 76 GB | 160M | tweets | Social Network Analysis | |||
UKPConvArg1 | TU Darmstadt | 2016 | 21 MB | 16K | argument pairs | Computational Argumentation | ||
UKPConvArg2 | TU Darmstadt | 2016 | 23 MB | 9K | argument pairs | Computational Argumentation | ||
Uppsala Student English | Uppsala University | 2001 | 3 MB | 2K | documents | Cross-Language Analysis | ||
USPTO Patents from 2001 to 2010 | U.S. Patent & Trademark Office | 2010 | 10 TB | - | - | Patent Analysis | ||
VQuAnDa | Kacupaj et al. | 2020 | 2 MB | 5K | question-answer-SPARQL query triplets | Answer Verbalization | ||
WaCKy: deWaC | Web-As-Corpus Kool Yinitiative | 2009 | 26 GB | 2B | words | Text Analysis (German) | ||
WaCKy: frWaC | Web-As-Corpus Kool Yinitiative | 2009 | 5 GB | 2B | words | Text Analysis (French) | ||
WaCKy: itWaC | Web-As-Corpus Kool Yinitiative | 2009 | 31 GB | 2B | words | Text Analysis (Italian) | ||
WaCKy: sdeWaC | Web-As-Corpus Kool Yinitiative | 2009 | 20 GB | 1B | words | Text Analysis (German) | ||
WaCKy: ukWaC | Web-As-Corpus Kool Yinitiative | 2009 | 15 GB | 2B | words | Text Analysis (English) | ||
WaCKy: WaCkypedia_EN | Web-As-Corpus Kool Yinitiative | 2009 | 6 GB | 1B | words | Text Analysis (English) | ||
WCEP MDS Dataset: Wikipedia Current Events Portal | Aylien Ltd., Dublin, Ireland | 2020 | 2 GB | 2M | document clusters with one human-written summary per cluster | Text Summarization, Multi-document | ||
Web People Search Corpus (WePS-1) | NLP Group (UNED), Proteus Project (NYU) | 2007 | 295 MB | 2K | web pages | Person Disambiguation, Text Clustering | ||
Web People Search Corpus (WePS-2) | NLP Group (UNED), Proteus Project (NYU) | 2009 | 328 MB | 3K | web pages | Person Disambiguation, Text Clustering | ||
Web People Search Corpus (WePS-3) | NLP Group (UNED), Proteus Project (NYU) | 2010 | 571 MB | 50K | web pages | Person Disambiguation, Text Clustering | ||
WikiHow Summarization Corpus | University of California | 2018 | 2 GB | 230K | article-summary, paragraph-summary pairs | Text Summarization | ||
Wikipedia Full Dump | Wikimedia Foundation | 2011 | 5 TB | - | - | Data Mining | ||
Wikipedia History Snapshots | Wikimedia Foundation | 2006-2012 | 32 GB | - | - | Data Mining | ||
Wikipedia Participation Challenge | Wikimedia Foundation | 2011 | 976 MB | - | - | User Behaviour Prediction | ||
Wikipedia Revision Dump | Wikimedia Foundation | 2006 | 46 GB | - | - | Data Mining | ||
Wikipedia Revision Dump | Wikimedia Foundation | 2008 | 133 GB | - | - | Data Mining | ||
Wikipedia Snapshots | Wikimedia Foundation | 2006-2012 | 280 GB | - | - | Data Mining | ||
WikiSum Corpus | Amazon | 2021 | 115 MB | 40K | article-summary pairs | Text Summarization | ||
Wordsim353 | L. Finkelstein et al. | 2002 | 60 KB | 353 | word pairs | Word Similarities | ||
Wortschatz Leipzig | Universität Leipzig | 2006 | 8 GB | 15 | languages | Text Analysis (Multilingual) | ||
XL-Sum Corpus | Bangladesh University of Engineering and Technology | 2021 | 1 GB | 1M | article-summary pairs | Text Summarization, Multilingual Text Summarization | ||
XSum Corpus | University of Edinburgh | 2018 | 240 MB | 214K | article-summary pairs | Text Summarization | ||
Yahoo Learning To Rank Challenge 2010 | Yahoo | 2010 | 421 MB | - | - | Document Ranking | ||
Yahoo N-Grams | Yahoo | 2006 | 13 GB | - | - | Text Analysis (English) |