https://github.com/zaw4rud0/patreon-scraper
A Selenium setup for scraping Patreon subscriptions
https://github.com/zaw4rud0/patreon-scraper
patreon patreon-scraper scraper selenium
Last synced: about 2 months ago
JSON representation
A Selenium setup for scraping Patreon subscriptions
- Host: GitHub
- URL: https://github.com/zaw4rud0/patreon-scraper
- Owner: zaw4rud0
- License: mit
- Created: 2024-11-16T18:03:46.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-01-25T17:26:45.000Z (4 months ago)
- Last Synced: 2025-02-13T21:37:20.054Z (4 months ago)
- Topics: patreon, patreon-scraper, scraper, selenium
- Language: Python
- Homepage:
- Size: 83 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PatreonScraper
A Selenium setup for scraping Patreon subscriptions## Important Information
- This program requires a valid Patreon account with active subscriptions.
- No data nor credentials will be shared or exposed while using this application. Everything
happens locally while a Selenium client scrapes the specified artists on Patreon.### Data Structure
The output is a JSON file that has the following structure:
```json
[
{
"id": "",
"title": "",
"date": "",
"content": "",
"images": [
"",
...
],
"tags": [
"",
...
],
"url": ""
}
]
```| Field | Type | Description |
|-----------|----------------|---------------------------------------------------------------------------------------------------------------------------|
| `id` | `` | The Patreon id of this post. It's guaranteed unique. |
| `title` | `` | The title of the post. |
| `date` | `` | The publish date of the post. The format is always `YYYY-MM-DD`. |
| `content` | `` | The body text of the post. Can be empty. |
| `images` | `` | The images of the post. It always uses the relative path to the parent folder of the output JSON file. Between 0 and `N`. |
| `tags` | `` | The tags of the post. Can be used to group or search posts. Between 0 and `M`. |
| `url` | `` | The Patreon URL of the post. |## Setup
### Requirements
To run this project, you need to have
- Python (tested with 3.12)
- pipinstalled on your machine.
### Installation
1. Clone this repository using:
```
git clone https://github.com/zaw4rud0/PatreonScraper.git
cd PatreonScraper
```
2. Install the required dependencies using:
```
pip install -r requirements.txt
```### Configurations
#### Program Configuration
Make a copy of `.env.example` by running the following command:
```
cp .env.example .env
```
Replace the placeholder values in the `.env` file with the actual values.#### Artists Configuration
Make a copy of `artists.example.json` by running the following command:
```
cp artists.example.json artists.json
```
In this file you can set the artists you want to scrape and define a tag mapping in case the artist has inconsistent tags on their posts.### Running
1. Start the scraper:
```
python -m src.main
```
2. Run unit tests:
```
pytest
```## Roadmap
- [x] Ability to scrape different artists in one run
- [x] Download images of scraped posts and place them in `/{OUTPUT_FOLDER}/{ARTIST}/{IMAGES}/{YEAR}/{MONTH}/`
- [ ] Ability to store scraped posts in a database out of the box
- [ ] More control over the scraping process, i.e. when the user wants to change the Patreon filters
- [ ] GUI window to show scraped posts, including their images. Use scraped tags to filter and search.