An open API service indexing awesome lists of open source software.

https://github.com/zaw4rud0/patreon-scraper

A Selenium setup for scraping Patreon subscriptions
https://github.com/zaw4rud0/patreon-scraper

patreon patreon-scraper scraper selenium

Last synced: about 2 months ago
JSON representation

A Selenium setup for scraping Patreon subscriptions

Awesome Lists containing this project

README

        

# PatreonScraper
A Selenium setup for scraping Patreon subscriptions

## Important Information

- This program requires a valid Patreon account with active subscriptions.
- No data nor credentials will be shared or exposed while using this application. Everything
happens locally while a Selenium client scrapes the specified artists on Patreon.

### Data Structure

The output is a JSON file that has the following structure:
```json
[
{
"id": "",
"title": "",
"date": "",
"content": "",
"images": [
"",
...
],
"tags": [
"",
...
],
"url": ""
}
]
```

| Field | Type | Description |
|-----------|----------------|---------------------------------------------------------------------------------------------------------------------------|
| `id` | `` | The Patreon id of this post. It's guaranteed unique. |
| `title` | `` | The title of the post. |
| `date` | `` | The publish date of the post. The format is always `YYYY-MM-DD`. |
| `content` | `` | The body text of the post. Can be empty. |
| `images` | `` | The images of the post. It always uses the relative path to the parent folder of the output JSON file. Between 0 and `N`. |
| `tags` | `` | The tags of the post. Can be used to group or search posts. Between 0 and `M`. |
| `url` | `` | The Patreon URL of the post. |

## Setup

### Requirements

To run this project, you need to have
- Python (tested with 3.12)
- pip

installed on your machine.

### Installation

1. Clone this repository using:
```
git clone https://github.com/zaw4rud0/PatreonScraper.git
cd PatreonScraper
```
2. Install the required dependencies using:
```
pip install -r requirements.txt
```

### Configurations

#### Program Configuration

Make a copy of `.env.example` by running the following command:
```
cp .env.example .env
```
Replace the placeholder values in the `.env` file with the actual values.

#### Artists Configuration

Make a copy of `artists.example.json` by running the following command:
```
cp artists.example.json artists.json
```
In this file you can set the artists you want to scrape and define a tag mapping in case the artist has inconsistent tags on their posts.

### Running

1. Start the scraper:
```
python -m src.main
```
2. Run unit tests:
```
pytest
```

## Roadmap

- [x] Ability to scrape different artists in one run
- [x] Download images of scraped posts and place them in `/{OUTPUT_FOLDER}/{ARTIST}/{IMAGES}/{YEAR}/{MONTH}/`
- [ ] Ability to store scraped posts in a database out of the box
- [ ] More control over the scraping process, i.e. when the user wants to change the Patreon filters
- [ ] GUI window to show scraped posts, including their images. Use scraped tags to filter and search.