https://github.com/sushantnair/arxiv_extractor
This code can effectively convert PDF Research Papers to clean Text files, avoiding images and tables.
https://github.com/sushantnair/arxiv_extractor
arxiv arxiv-papers good-project mozilla-firefox pdf pdf-to-text research-paper text
Last synced: about 2 months ago
JSON representation
This code can effectively convert PDF Research Papers to clean Text files, avoiding images and tables.
- Host: GitHub
- URL: https://github.com/sushantnair/arxiv_extractor
- Owner: sushantnair
- Created: 2025-01-17T17:54:49.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-01-18T16:05:22.000Z (4 months ago)
- Last Synced: 2025-01-30T11:16:27.299Z (4 months ago)
- Topics: arxiv, arxiv-papers, good-project, mozilla-firefox, pdf, pdf-to-text, research-paper, text
- Language: Python
- Homepage:
- Size: 7.81 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# arXiv Extractor
#### This code (for now) can effectively convert PDF Research Papers to clean Text files, avoiding images and tables.
# Requirements
#### It requires Playwright.
```pip install playwright```
#### It also requires firefox to be installed in Playwright environment.
```playwright install firefox```
# Compatible Softwares
#### Currently, it works only for Mozilla Firefox Browsers.
This is because of the way Firefox renders the PDFs. It does in a way quite different than other browsers. This made it advantageous and easier.
# Features to Add
1. Conversion from PDF to other formats as well
2. More field-specific cleaning (not just CS papers)
# FAQ
#### 1. Why I did not use the ArXiv API?
"This url calls the api, which returns the results in the Atom 1.0 format."
To know more, click here for the official documentation.
#### 2. Will additional browser support be added?
Well, I think it is easier to just install Firefox! You can open up a PDF of a research paper in other browsers vs. Firefox and understand the notable difference in the way the information is presented.
#### 3. Does it work for all OS?
Actually at the time of release it worked fine on Windows. I am not sure about other OS.
#### 4. Is there any scope for contributions?
Absolutely! If you can find a way to extract information from the unhelpful way it is displayed in other Browsers, or if you can extend support for other OS, or if you think there is a way you can further improve the quality of the extracted text, then you are welcome! Just submit a PR.
# Gratitude
#### I am grateful for the help I got from Mistral's Le Chat. It helped me overcome significant challenges.