https://github.com/nlpatvcu/pdf2txt
Converts a pdf document to text.
https://github.com/nlpatvcu/pdf2txt
Last synced: about 2 months ago
JSON representation
Converts a pdf document to text.
- Host: GitHub
- URL: https://github.com/nlpatvcu/pdf2txt
- Owner: NLPatVCU
- License: apache-2.0
- Created: 2021-01-15T17:38:07.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2022-04-15T19:06:03.000Z (about 4 years ago)
- Last Synced: 2025-01-17T10:24:46.345Z (over 1 year ago)
- Language: Java
- Size: 65.4 KB
- Stars: 1
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PDF2TXT
PDF2TXT can be used to either convert a single .pdf file to a .txt file or all .pdf files in a given directory to .txt files.

Installation
============
when in the python 3 virtual environment:
To install PDF2TXT:
```python
git clone https://github.com/NLPatVCU/PDF2TXT.git
```
You would also need to install the Haystack framework and milvus.
```python
pip3 install pymilvus==1.0.0
pip3 install farm-haystack==1.0.0
```
If you experience any difficulties, try visiting their site: https://github.com/deepset-ai/haystack
Use
===
To convert a single file, run:
```python
python3 pdf2txt.py -f
```
To convert an entire directory, run:
```python
python3 pdf2txt.py -d
```
To write output files into a specific directory, append with:
```python
-o
```
License
=======
This package is licensed under the GNU General Public License
Acknowledgments
===============
- [VCU Natural Language Processing Lab](https://nlp.cs.vcu.edu/) 
- [Nanoinformatics Vertically Integrated Projects](https://rampages.us/nanoinformatics/)