ACL Anthology UIUC Corpus

This repository contains a data set of papers in the natural language processing domain, which is a subset of the ACL Antology. Specifically, it includes 40,367 papers whose copyright belongs to ACL, up to October 2018. Using OCR software, we analyzed the papers to extarct text, citations, figures, and tables. In the following sections we describe the different files in this repository. (In general, the names of the diffrent files in this repository correspond to the paper idendifiers according to the ACL Antology.)

This data was curated and preprocessed by Saar Kuzi, a Ph.D. student at the university of Illinois Urbana-Champaign. For any questions, please feel free to contact me at skuzi2@illinois.edu.

Pdf files

The original pdf files of the papers.

Paper text and logical structure

We used the Grobid toolkit to extract the text from the pdf files. Each xml file corresponds to a single paper. The output of this toolkit also provides a logical structure of the paper which includes information, such as: paper sectoions, title, abstarct, figure/table locations, and citations.

Selected paper fields for easy indexing

For each paper, we provide the title, abstract, and introduction.

Figures and Tables

Meta data

We used the PDFfigures toolkit to extract figures and tables from the papers. Each json file corresponds to a single paper and contains information about the different figures/tables in the paper, including the caption and the name of the corrpesonding image file (the image file itself can be found in a separate folder in this repository - see below).

Image files

The image files of tables/figures of the papers. The file name specifies the paper id and the figure/table number.

Data for easy indexing

For each paper, we provide an xml file which contains the paper's figures and tables. A 'figure' ('table') tag marks the begining of data which is related to the figure (table), as follows:

mentionX - X words surrounding an explicit mention of a figure in the paper (multiple mentions are merged).
linesX - X lines surrounding an explicit mention of the figure in the paper (multiple mentions are merged).
snippetX - A snippet of X lines that can be presented in a search engine.

We also provide for each figure (table) the fields of its paper (title, abstarct, and introduction). This data, for example, can be used for indexing in order to build a search engine as it was used in this project.