Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/evilsh3ll/datahoarder-website-to-markdown

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/evilsh3ll/datahoarder-website-to-markdown
Owner: evilsh3ll
License: gpl-2.0
Created: 2023-03-23T09:21:06.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-08-07T08:02:44.000Z (over 1 year ago)
Last Synced: 2024-08-01T20:48:29.942Z (6 months ago)
Language: Shell
Size: 24.4 KB
Stars: 33
Watchers: 1
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

project-awesome - evilsh3ll/datahoarder-website-to-markdown - (Shell)

README

# 💾 datahoarder-website-to-markdown 🏴‍☠️

### Description ⚡
The script takes a cookie and a list of forum/webpage indexes as input, then it scrapes all urls from the indexes and download the associated pages (html). All html files are converted to **lightweight** markdown pages (~15-20Kb), then they are trimmed (the **sed** trimming parameters must be edited because they are different from website to website) and saved in folders called as the index (read the list at the top of the script).
All the scraped contents are uploaded to a remote git repository (you can store the git credentials by configuring git, so you can make the whole process automatic).
- forums with "click Like to show the thread" are supported by this script
- if there is a connection error or the website blocks the scraping, the script can be resumed without losing the previously scraped files
- deleted files are moved to trashbin (I don't use **rm** but **gio trash**)
- the script must be edited in order to be correctly executed

### Screens 🖼️
![image](https://i.imgur.com/gDKXN9T.png)

### Dependences 📜
- [html2md](https://github.com/suntong/html2md)
- [crawley](https://github.com/s0rg/crawley)
- curl
- rsync
- git