https://github.com/therealaj/bulk-pdf

Downloads all PDFs on a webpage (for lazy people)
https://github.com/therealaj/bulk-pdf

scraper wget

Last synced: 6 months ago
JSON representation

Downloads all PDFs on a webpage (for lazy people)

Host: GitHub
URL: https://github.com/therealaj/bulk-pdf
Owner: therealAJ
License: mit
Created: 2017-04-23T22:35:34.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2022-01-20T15:35:05.000Z (over 3 years ago)
Last Synced: 2025-04-02T04:23:25.815Z (6 months ago)
Topics: scraper, wget
Language: Python
Size: 2.2 MB
Stars: 23
Watchers: 3
Forks: 5
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Bulk PDF downloader

* will with work for local and non-hosted PDF's *

> a cli for downloading external pdf's (for lazy people like me)

## Demo

![Alt text](https://raw.githubusercontent.com/therealAJ/bulk-pdf/master/media/demo.gif)

## Motivation
One day I was downloading what felt like millions of PDF packages of CS notes. A couple minutes in, I got really tired of right clicking `Save Link As`. So I decided to build this :)

## Requirements

`argparse`

`urllib`

`requests`

`wget`

`python >= 3.5`

## Install

```sh
$ git clone https://github.com/therealAJ/bulk-pdf
$ cd bulk-pdf
$ pip install -r reqs.txt
```

## Usage

```sh
$ python pdf.py [OPTIONAL-BASE-URL-FOR-HOSTED-PDFS]
```

Downloads discovered PDFs to specified `path`.

For example:

```sh
$ python pdf.py https://www.cs.ubc.ca/~schmidtm/Courses/340-F16/ ~/Desktop/Lectures
# downloads all PDFs to your `~/Desktop/Lectures` folder
```

### Notes

Webpages have different ways of hosting PDFs, some use an absolute url, like `https://hosting-site.com/hello.pdf`, others host locally and have `href` tags looking something like `\courses\cs101\hello.pdf`. The second case is the reason I added the ability to give the optional url for the parent hosting site.

You may run into a url that looks like `https://site.com/lectures.html`. More often than not, this is where you want to use the full url to parse the entire webpage and then use `https://site.com` as the optional parameter to do the `wget` requests with.

So, an example would look like:
```sh
$ python pdf.py https://site.com/lectures.html ~/Desktop/Test https://site.com
```

Hopefully I haven't confused you :) File an issue if you run into anything of interest. PRs welcome :)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/therealaj/bulk-pdf

Awesome Lists containing this project

README

Bulk PDF downloader