Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/valaydave/arxiv-miner
arxiv_miner is a toolkit for mining research papers on CS ArXiv.
https://github.com/valaydave/arxiv-miner
arxiv-papers parsing scientific-research scraping search
Last synced: 19 days ago
JSON representation
arxiv_miner is a toolkit for mining research papers on CS ArXiv.
- Host: GitHub
- URL: https://github.com/valaydave/arxiv-miner
- Owner: valayDave
- License: mit
- Created: 2020-06-23T06:20:10.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-03-23T20:13:39.000Z (8 months ago)
- Last Synced: 2024-10-14T16:07:56.753Z (about 1 month ago)
- Topics: arxiv-papers, parsing, scientific-research, scraping, search
- Language: Python
- Homepage: https://arxiv-miner.turing-bot.com/
- Size: 159 KB
- Stars: 124
- Watchers: 4
- Forks: 8
- Open Issues: 5
-
Metadata Files:
- Readme: Readme.md
- Contributing: docs/contributing.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# ArXiv-Miner
> ArXiv Miner is a toolkit for mining research papers on CS ArXiv.
## What is ArXiv-Miner
`arxiv-miner` is a quick handy library that helps power Sci-Genie [Project is no longer hosted and parts of it will be open-sourced in the future]. Sci-Genie was a search engine to quickly search through full text of papers on CS ArXiv.
`arxiv-miner` helps extract and parse LaTeX documents from CS ArXiv. It also supports storage and search of those parsed documents using **Elasticsearch**. The library can be applicable for all other domains like Math, Physics, Biology etc.
## Documentation
All documentation on how to install and use `arxiv-miner` is provided in the [documentation website](https://arxiv-miner.turing-bot.com/) or inside the [docs folder](docs). Contribution guidelines are also provided there.## Why was ArXiv-Miner created ?
ArXiv Miner was created for easily scraping, parsing and searching research content on ArXiv. This library was created after stitching together solutions from the code of various tools like [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver), [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo), [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper), [tex2py](https://github.com/alvinwan/tex2py), [cso-classifier](https://github.com/angelosalatino/cso-classifier/) and [axcell](https://github.com/paperswithcode/axcell). Parsed structure of the content can be useful in search or any scientific research mining/AI applications as a heuristic baseline.## Core Components of ArXiv-Miner
- Scraping
- Parsing
- Indexing/Storage## Family Of Projects With ArXiv-Miner
- `arxiv-table-miner` : Coming Soon.
- `arxiv-table-ml-models` : Coming Soon.
- `semantic-scholar-data-pipeline` : https://github.com/valayDave/semantic-scholar-data-pipeline## Disclaimer
This project was developed like a [Cowboy coder](https://en.wikipedia.org/wiki/Cowboy_coding) over the [COVID-19 pandemic](https://en.wikipedia.org/wiki/COVID-19_pandemic). Hence, this **may have bugs and not the most well optimized code**. The primary reason for development was to aid CS and Machine Learning/AI research, but this tool can be extended to all 3M+ documents on ArXiv.## Call For Contributors
Any help with contributions to improve the project or fix bugs are completely welcome. Please read the contribution guide in the documentation.## Credits and Appreciation
This project like all others has been built on shoulders of giants. A big thanks to the creators of the following libraries/open source projects that aided the development of `arxiv-miner`, and it's family of projects:
- [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver)
- [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo)
- [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper)
- [tex2py](https://github.com/alvinwan/tex2py)
- [cso-classifier](https://github.com/angelosalatino/cso-classifier/)
- [axcell](https://github.com/paperswithcode/axcell)
- [elasticsearch](https://github.com/elastic/elasticsearch)
- [Semantic Scholar Open Research corpus](https://github.com/allenai/s2orc)
- [metaflow](https://metaflow.org)
- [docsify](https://docsify.js.org/#/)
## Licence
MIT