This repository contains a data set of papers in the natural language processing domain, which is a subset of the ACL Antology. Specifically, it includes 40,367 papers whose copyright belongs to ACL, up to October 2018. Using OCR software, we analyzed the papers to extarct text, citations, figures, and tables. In the following sections we describe the different files in this repository. (In general, the names of the diffrent files in this repository correspond to the paper idendifiers according to the ACL Antology.)
This data was curated and preprocessed by Saar Kuzi, a Ph.D. student at the university of Illinois Urbana-Champaign. For any questions, please feel free to contact me at skuzi2@illinois.edu.
The original pdf files of the papers.
We used the Grobid toolkit to extract the text from the pdf files. Each xml file corresponds to a single paper. The output of this toolkit also provides a logical structure of the paper which includes information, such as: paper sectoions, title, abstarct, figure/table locations, and citations.
For each paper, we provide the title, abstract, and introduction.
We used the PDFfigures toolkit to extract figures and tables from the papers. Each json file corresponds to a single paper and contains information about the different figures/tables in the paper, including the caption and the name of the corrpesonding image file (the image file itself can be found in a separate folder in this repository - see below).
The image files of tables/figures of the papers. The file name specifies the paper id and the figure/table number.
For each paper, we provide an xml file which contains the paper's figures and tables. A 'figure' ('table') tag marks the begining of data which is related to the figure (table), as follows: