unarXive

Access

Data Set on Zenodo: full / permissively licensed subset
Data Sample
ML Data on Hugging Face: citation recommendation / IMRaD classification

Documentation

Publications
- Scientometrics (author copy) (2020)
- JCDL 2023 (author copy) (2023)
Data Format
Usage
Development
Cite

Data

unarXive contains

1.9 M structured paper full-texts, containing
- 63 M references (28 M linked to OpenAlex)
- 134 M in-text citation markers (65 M linked)
- 9 M figure captions
- 2 M table captions
- 742 M pieces of mathematical notation preserved as LaTeX

A comprehensive documentation of the data format can be found here.

You can find a data sample here.

Usage

Hugging Face Datasets

If you want to use unarXive for citation recommendation or IMRaD classification, you can simply use our Hugging Face datasets:

For example, in the case of citation recommendation:

from datasets import load_dataset

citrec_data = load_dataset('saier/unarxive_citrec')
citrec_data = citrec_data.class_encode_column('label')  # assign target label column
citrec_data = citrec_data.remove_columns('_id')         # remove sample ID column

Development

For instructions how to re-create or extend unarXive, see src/.

Versions

Current release (1991–2022): see Access section above
Previous releases (old format):
- 1991–Jul 2020
- 1991–2019

Development Status

See issues.

Cite as

Current version

@inproceedings{Saier2023unarXive,
  author        = {Saier, Tarek and Krause, Johan and F\"{a}rber, Michael},
  title         = {{unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network}},
  booktitle     = {2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL)},
  year          = {2023},
  pages         = {66--70},
  month         = jun,
  doi           = {10.1109/JCDL57899.2023.00020},
  publisher     = {IEEE Computer Society},
  address       = {Los Alamitos, CA, USA},
}

Initial publication

@article{Saier2020unarXive,
  author        = {Saier, Tarek and F{\"{a}}rber, Michael},
  title         = {{unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata}},
  journal       = {Scientometrics},
  year          = {2020},
  volume        = {125},
  number        = {3},
  pages         = {3085--3108},
  month         = dec,
  issn          = {1588-2861},
  doi           = {10.1007/s11192-020-03382-z}
}

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
doc		doc
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

unarXive

Data

Usage

Hugging Face Datasets

Development

Cite as

About

Contributors 4

Languages

License

IllDepence/unarXive

Folders and files

Latest commit

History

Repository files navigation

unarXive

Data

Usage

Hugging Face Datasets

Development

Cite as

About

Resources

License

Stars

Watchers

Forks

Languages