Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/simongarisch/pdf_hunter

Download PDF links from a webpage
https://github.com/simongarisch/pdf_hunter

Last synced: 13 days ago
JSON representation

Download PDF links from a webpage

Host: GitHub
URL: https://github.com/simongarisch/pdf_hunter
Owner: simongarisch
License: mit
Created: 2019-12-08T18:12:23.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2021-04-28T01:28:46.000Z (over 3 years ago)
Last Synced: 2023-03-10T00:08:59.844Z (almost 2 years ago)
Language: Python
Size: 20.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # pdf_hunter

Search for and download PDF file links from a webpage. 

## Installation

This has been tested using Python 3 and Python 2.7.

```bash

pip install pdf_hunter

```

## Usage

```python

import pdf_hunter

url = "https://github.com/EbookFoundation/free-programming-books/blob/master/free-programming-books.md"

```

```python

pdf_urls = pdf_hunter.get_pdf_urls(url)

pdf_urls[:10]

```

['https://people.gnome.org/~swilmet/glib-gtk-dev-platform.pdf',

 'https://www.math.upenn.edu/~wilf/AlgoComp.pdf',

 'http://cslibrary.stanford.edu/110/BinaryTrees.pdf',

 'http://www-inst.eecs.berkeley.edu/~cs61b/fa14/book2/data-structures.pdf',

 'http://lib.mdp.ac.id/ebook/Karya%20Umum/Dsa.pdf',

 'http://cslibrary.stanford.edu/103/LinkedListBasics.pdf',

 'http://cslibrary.stanford.edu/105/LinkedListProblems.pdf',

 'http://www.jjj.de/fxt/fxtbook.pdf',

 'http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf',

 'http://igm.univ-mlv.fr/~mac/REC/text-algorithms.pdf']

## We can download a single PDF file from a given url

```python

pdf_url = pdf_urls[0]

pdf_url

```

'https://people.gnome.org/~swilmet/glib-gtk-dev-platform.pdf'

```python

file_name = pdf_hunter.get_pdf_name(pdf_url)

file_name

```

'glib-gtk-dev-platform.pdf'

```python

import os

os.path.isfile(file_name)

```

False

```python

pdf_hunter.download_file(pdf_url, folder_path=os.getcwd())

os.path.isfile(file_name)

```

True

## Or download all PDF files from the page

```python

pdf_hunter.download_pdf_files(url, folder_path=os.getcwd())

```

***