https://github.com/zaw4rud0/patreon-scraper

A Selenium setup for scraping Patreon subscriptions
https://github.com/zaw4rud0/patreon-scraper

patreon patreon-scraper scraper selenium

Last synced: about 2 months ago
JSON representation

A Selenium setup for scraping Patreon subscriptions

Host: GitHub
URL: https://github.com/zaw4rud0/patreon-scraper
Owner: zaw4rud0
License: mit
Created: 2024-11-16T18:03:46.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-01-25T17:26:45.000Z (4 months ago)
Last Synced: 2025-02-13T21:37:20.054Z (4 months ago)
Topics: patreon, patreon-scraper, scraper, selenium
Language: Python
Homepage:
Size: 83 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# PatreonScraper
A Selenium setup for scraping Patreon subscriptions

## Important Information

- This program requires a valid Patreon account with active subscriptions.
- No data nor credentials will be shared or exposed while using this application. Everything
happens locally while a Selenium client scrapes the specified artists on Patreon.

### Data Structure

The output is a JSON file that has the following structure:
```json
[
{
"id": "",
"title": "",
"date": "",
"content": "",
"images": [
"",
...
],
"tags": [
"",
...
],
"url": ""
}
]
```

| Field | Type | Description |
|-----------|----------------|---------------------------------------------------------------------------------------------------------------------------|
| `id` | `` | The Patreon id of this post. It's guaranteed unique. |
| `title` | `` | The title of the post. |
| `date` | `` | The publish date of the post. The format is always `YYYY-MM-DD`. |
| `content` | `` | The body text of the post. Can be empty. |
| `images` | `` | The images of the post. It always uses the relative path to the parent folder of the output JSON file. Between 0 and `N`. |
| `tags` | `` | The tags of the post. Can be used to group or search posts. Between 0 and `M`. |
| `url` | `` | The Patreon URL of the post. |

## Setup

### Requirements

To run this project, you need to have
- Python (tested with 3.12)
- pip

installed on your machine.

### Installation

1. Clone this repository using:
```
git clone https://github.com/zaw4rud0/PatreonScraper.git
cd PatreonScraper
```
2. Install the required dependencies using:
```
pip install -r requirements.txt
```

### Configurations

#### Program Configuration

Make a copy of `.env.example` by running the following command:
```
cp .env.example .env
```
Replace the placeholder values in the `.env` file with the actual values.

#### Artists Configuration

Make a copy of `artists.example.json` by running the following command:
```
cp artists.example.json artists.json
```
In this file you can set the artists you want to scrape and define a tag mapping in case the artist has inconsistent tags on their posts.

### Running

1. Start the scraper:
```
python -m src.main
```
2. Run unit tests:
```
pytest
```

## Roadmap

- [x] Ability to scrape different artists in one run
- [x] Download images of scraped posts and place them in `/{OUTPUT_FOLDER}/{ARTIST}/{IMAGES}/{YEAR}/{MONTH}/`
- [ ] Ability to store scraped posts in a database out of the box
- [ ] More control over the scraping process, i.e. when the user wants to change the Patreon filters
- [ ] GUI window to show scraped posts, including their images. Use scraped tags to filter and search.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zaw4rud0/patreon-scraper

Awesome Lists containing this project

README