Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/purarue/old_forums
Parses posts/achievements from random forums I used in the past
https://github.com/purarue/old_forums
forum minecraft selenium webscraping
Last synced: about 2 months ago
JSON representation
Parses posts/achievements from random forums I used in the past
- Host: GitHub
- URL: https://github.com/purarue/old_forums
- Owner: purarue
- License: apache-2.0
- Created: 2020-09-04T01:17:30.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-10-25T17:35:28.000Z (4 months ago)
- Last Synced: 2024-11-01T14:43:20.343Z (3 months ago)
- Topics: forum, minecraft, selenium, webscraping
- Language: Python
- Homepage:
- Size: 22.5 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# old_forums
Parses posts/achievements from random forums I used in the past. I don't use any of these anymore, but they contain random thoughts I had back then, so parsing them so I have access to them
The bit of lib code here pulls CSS selectors from a config file to detect/parse achievement pages. I use this in my personal [HPI](https://github.com/purarue/HPI-personal) modules
The forum posts are loaded from JSON files created by `./selenium_scripts`, while forum achievements are parsed from the raw HTML pages (i.e., by right click and `save as`ing a page, so that its possible to update)
This is quite a personal library, as generalizing this to any amount of sites isn't trivial, though the [`achievements` portion](./old_forums/achievements.py) of the library could possibly be re-used, if you have some webscraping know-how
## Installation
Requires `python3.7+`
To install with pip, run:
pip install git+https://github.com/purarue/old_forums
### selenium_scripts
Putting these up here as reference. I have so little posts on some of these that didn't have to worry about pagination.
All the posts get pulled out into a common schema:
```
forum_name: str
post_title: str (name/title of the post)
post_url: str (url to the post)
post_contents: str (what I actually said)
dt: epoch datetime
```Based on code from [`steamscraper`](https://github.com/purarue/steamscraper)
As an example; `minecraft_forum.py`
```
python3 ./minecraft_forum.py --to-file ./minecraft_forum.json
Hit enter when the page is ready >
[D 200903 17:18:15 minecraft_forum:49] getting next page...
[D 200903 17:18:24 minecraft_forum:49] getting next page...
[D 200903 17:18:32 minecraft_forum:49] getting next page...
[D 200903 17:18:39 minecraft_forum:49] getting next page...
[D 200903 17:18:46 minecraft_forum:49] getting next page...
[D 200903 17:18:54 minecraft_forum:49] getting next page...
[D 200903 17:19:01 minecraft_forum:49] getting next page...
[D 200903 17:19:08 minecraft_forum:49] getting next page...
[D 200903 17:19:16 minecraft_forum:49] getting next page...
[D 200903 17:19:23 minecraft_forum:49] getting next page...
[D 200903 17:19:30 minecraft_forum:49] getting next page...
[D 200903 17:19:39 minecraft_forum:52] done, writing to file...
```