Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/purarue/old_forums

Parses posts/achievements from random forums I used in the past
https://github.com/purarue/old_forums

forum minecraft selenium webscraping

Last synced: about 2 months ago
JSON representation

Parses posts/achievements from random forums I used in the past

Host: GitHub
URL: https://github.com/purarue/old_forums
Owner: purarue
License: apache-2.0
Created: 2020-09-04T01:17:30.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2024-10-25T17:35:28.000Z (4 months ago)
Last Synced: 2024-11-01T14:43:20.343Z (3 months ago)
Topics: forum, minecraft, selenium, webscraping
Language: Python
Homepage:
Size: 22.5 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# old_forums

Parses posts/achievements from random forums I used in the past. I don't use any of these anymore, but they contain random thoughts I had back then, so parsing them so I have access to them

The bit of lib code here pulls CSS selectors from a config file to detect/parse achievement pages. I use this in my personal [HPI](https://github.com/purarue/HPI-personal) modules

The forum posts are loaded from JSON files created by `./selenium_scripts`, while forum achievements are parsed from the raw HTML pages (i.e., by right click and `save as`ing a page, so that its possible to update)

This is quite a personal library, as generalizing this to any amount of sites isn't trivial, though the [`achievements` portion](./old_forums/achievements.py) of the library could possibly be re-used, if you have some webscraping know-how

## Installation

Requires `python3.7+`

To install with pip, run:

pip install git+https://github.com/purarue/old_forums

### selenium_scripts

Putting these up here as reference. I have so little posts on some of these that didn't have to worry about pagination.

All the posts get pulled out into a common schema:

```
forum_name: str
post_title: str (name/title of the post)
post_url: str (url to the post)
post_contents: str (what I actually said)
dt: epoch datetime
```

Based on code from [`steamscraper`](https://github.com/purarue/steamscraper)

As an example; `minecraft_forum.py`

```
python3 ./minecraft_forum.py --to-file ./minecraft_forum.json
Hit enter when the page is ready >
[D 200903 17:18:15 minecraft_forum:49] getting next page...
[D 200903 17:18:24 minecraft_forum:49] getting next page...
[D 200903 17:18:32 minecraft_forum:49] getting next page...
[D 200903 17:18:39 minecraft_forum:49] getting next page...
[D 200903 17:18:46 minecraft_forum:49] getting next page...
[D 200903 17:18:54 minecraft_forum:49] getting next page...
[D 200903 17:19:01 minecraft_forum:49] getting next page...
[D 200903 17:19:08 minecraft_forum:49] getting next page...
[D 200903 17:19:16 minecraft_forum:49] getting next page...
[D 200903 17:19:23 minecraft_forum:49] getting next page...
[D 200903 17:19:30 minecraft_forum:49] getting next page...
[D 200903 17:19:39 minecraft_forum:52] done, writing to file...
```