Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/allofphysicsgraph/latex-in-arxiv

extract math latex from content in arxiv
https://github.com/allofphysicsgraph/latex-in-arxiv

latex

Last synced: about 2 months ago
JSON representation

extract math latex from content in arxiv

Host: GitHub
URL: https://github.com/allofphysicsgraph/latex-in-arxiv
Owner: allofphysicsgraph
Created: 2020-05-27T12:09:22.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2024-09-07T00:24:23.000Z (4 months ago)
Last Synced: 2024-09-07T03:10:22.732Z (4 months ago)
Topics: latex
Language: C++
Size: 370 MB
Stars: 4
Watchers: 3
Forks: 1
Open Issues: 21
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Overview
_Goal_: extract math Latex from `.tex` content available from arXiv.

_Caveat when cloning this repo_: Total download size is 640 MB.

## quick start

Read `latex-in-arxiv/postings_list/README.md`

Everything is containerized, so in this repo (`latex-in-arxiv/`) use
either `make docker` (for linux) or `make docmac` (for Mac).

To run the application, within the Docker image run `/opt/scanner.out .`

To recompile the scanner, within the Docker image run
```bash
cd latex-in-arxiv/src/postings_list/query
make scanner
make read_tf_idf
./scanner.out .
./scanner.out . offsets
./read_tf_idf.out tf_idf # the vocabulary for TF-IDF uses the tokens from parsed Latex
# TF-IDF is for identify the most relevant variable to find the definition for in a paper
```

## so what?

Suppose you have a `.tex` file that contains math, like
```latex
\documentclass{article}
\title{test}
\begin{document}
\maketitle
\section{Introduction}
This is a great paper.
\begin{equation}
a+b = c
\end{equation}
Where $c$ is some variable.
\end{document}
```
There's an expression, `a+b=c` and an in-line variable `c`.
How can the expression and the variables be extracted?

There are a few options for parsing Latex; see
The options that are decent in terms of quality of results are also slow.

This repo uses [`ragel`](https://www.colm.net/open-source/ragel/) to quickly parse Latex and find math.

## get data

### an option that's free is a few years of arxiv data

In the directory `latex-in-arxiv/get_sample_data` use
```bash
make get_sample_data
```
### ArXiV API calls
```
# curl http://export.arxiv.org/api/query?search_query=all:rigorous%20derivation
```

### bulk processing: another option is the full arxiv data available from an S3 bucket
for details, see
```bash
# s3cmd get s3://arxiv/src/arXiv_src_manifest.xml . --requester-pays
# s3cmd get s3://arxiv/src/arXiv_src_9912_001.tar . --requester-pays
```