Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/allofphysicsgraph/latex-in-arxiv
extract math latex from content in arxiv
https://github.com/allofphysicsgraph/latex-in-arxiv
latex
Last synced: about 2 months ago
JSON representation
extract math latex from content in arxiv
- Host: GitHub
- URL: https://github.com/allofphysicsgraph/latex-in-arxiv
- Owner: allofphysicsgraph
- Created: 2020-05-27T12:09:22.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-09-07T00:24:23.000Z (4 months ago)
- Last Synced: 2024-09-07T03:10:22.732Z (4 months ago)
- Topics: latex
- Language: C++
- Size: 370 MB
- Stars: 4
- Watchers: 3
- Forks: 1
- Open Issues: 21
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Overview
_Goal_: extract math Latex from `.tex` content available from arXiv._Caveat when cloning this repo_: Total download size is 640 MB.
## quick start
Read `latex-in-arxiv/postings_list/README.md`
Everything is containerized, so in this repo (`latex-in-arxiv/`) use
either `make docker` (for linux) or `make docmac` (for Mac).To run the application, within the Docker image run `/opt/scanner.out .`
To recompile the scanner, within the Docker image run
```bash
cd latex-in-arxiv/src/postings_list/query
make scanner
make read_tf_idf
./scanner.out .
./scanner.out . offsets
./read_tf_idf.out tf_idf # the vocabulary for TF-IDF uses the tokens from parsed Latex
# TF-IDF is for identify the most relevant variable to find the definition for in a paper
```## so what?
Suppose you have a `.tex` file that contains math, like
```latex
\documentclass{article}
\title{test}
\begin{document}
\maketitle
\section{Introduction}
This is a great paper.
\begin{equation}
a+b = c
\end{equation}
Where $c$ is some variable.
\end{document}
```
There's an expression, `a+b=c` and an in-line variable `c`.
How can the expression and the variables be extracted?There are a few options for parsing Latex; see
The options that are decent in terms of quality of results are also slow.This repo uses [`ragel`](https://www.colm.net/open-source/ragel/) to quickly parse Latex and find math.
## get data
### an option that's free is a few years of arxiv data
In the directory `latex-in-arxiv/get_sample_data` use
```bash
make get_sample_data
```
### ArXiV API calls
```
# curl http://export.arxiv.org/api/query?search_query=all:rigorous%20derivation
```### bulk processing: another option is the full arxiv data available from an S3 bucket
for details, see
```bash
# s3cmd get s3://arxiv/src/arXiv_src_manifest.xml . --requester-pays
# s3cmd get s3://arxiv/src/arXiv_src_9912_001.tar . --requester-pays
```