https://github.com/py-pdf/pdf-crawler
This project goal is getting a large dataset of PDF documents
https://github.com/py-pdf/pdf-crawler
Last synced: 3 months ago
JSON representation
This project goal is getting a large dataset of PDF documents
- Host: GitHub
- URL: https://github.com/py-pdf/pdf-crawler
- Owner: py-pdf
- License: bsd-3-clause
- Created: 2022-06-29T16:37:34.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-07-24T05:26:29.000Z (over 3 years ago)
- Last Synced: 2025-04-22T18:25:52.205Z (9 months ago)
- Language: Python
- Size: 17.6 KB
- Stars: 3
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pdf-crawler
The goal of pdf-crawler is to download PDF files from web pages for testing
PyPDF2.
## Install
```
pip install -r requirements.txt
```
## Usage
It's organized in mostly isolted scripts, e.g.
```
python crawl.py
```
starts downloading PDF documents.