Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/simongarisch/pdf_hunter
Download PDF links from a webpage
https://github.com/simongarisch/pdf_hunter
Last synced: 13 days ago
JSON representation
Download PDF links from a webpage
- Host: GitHub
- URL: https://github.com/simongarisch/pdf_hunter
- Owner: simongarisch
- License: mit
- Created: 2019-12-08T18:12:23.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2021-04-28T01:28:46.000Z (over 3 years ago)
- Last Synced: 2023-03-10T00:08:59.844Z (almost 2 years ago)
- Language: Python
- Size: 20.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pdf_hunter
Search for and download PDF file links from a webpage.
## Installation
This has been tested using Python 3 and Python 2.7.
```bash
pip install pdf_hunter
```## Usage
```python
import pdf_hunterurl = "https://github.com/EbookFoundation/free-programming-books/blob/master/free-programming-books.md"
``````python
pdf_urls = pdf_hunter.get_pdf_urls(url)
pdf_urls[:10]
```
['https://people.gnome.org/~swilmet/glib-gtk-dev-platform.pdf',
'https://www.math.upenn.edu/~wilf/AlgoComp.pdf',
'http://cslibrary.stanford.edu/110/BinaryTrees.pdf',
'http://www-inst.eecs.berkeley.edu/~cs61b/fa14/book2/data-structures.pdf',
'http://lib.mdp.ac.id/ebook/Karya%20Umum/Dsa.pdf',
'http://cslibrary.stanford.edu/103/LinkedListBasics.pdf',
'http://cslibrary.stanford.edu/105/LinkedListProblems.pdf',
'http://www.jjj.de/fxt/fxtbook.pdf',
'http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf',
'http://igm.univ-mlv.fr/~mac/REC/text-algorithms.pdf']## We can download a single PDF file from a given url
```python
pdf_url = pdf_urls[0]
pdf_url
```'https://people.gnome.org/~swilmet/glib-gtk-dev-platform.pdf'
```python
file_name = pdf_hunter.get_pdf_name(pdf_url)
file_name
```'glib-gtk-dev-platform.pdf'
```python
import osos.path.isfile(file_name)
```False
```python
pdf_hunter.download_file(pdf_url, folder_path=os.getcwd())os.path.isfile(file_name)
```True
## Or download all PDF files from the page
```python
pdf_hunter.download_pdf_files(url, folder_path=os.getcwd())
```***