Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jakesteam/stumbleupon-extract
Extracted & parsed StumbleUpon data
https://github.com/jakesteam/stumbleupon-extract
csv html python stumbleupon wayback-machine
Last synced: 4 days ago
JSON representation
Extracted & parsed StumbleUpon data
- Host: GitHub
- URL: https://github.com/jakesteam/stumbleupon-extract
- Owner: JakeSteam
- License: mit
- Created: 2024-10-17T14:21:17.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-10-27T19:12:50.000Z (4 months ago)
- Last Synced: 2024-12-17T22:33:28.750Z (about 2 months ago)
- Topics: csv, html, python, stumbleupon, wayback-machine
- Language: HTML
- Homepage: https://blog.jakelee.co.uk/bulk-downloading-website-history-and-parsing/
- Size: 5.97 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# StumbleUpon extract tools & data
Fully processed StumbleUpon data extracted from the Wayback Machine, for [an article](https://blog.jakelee.co.uk/bulk-downloading-website-history-and-parsing/).
## What's in this repo?
- `/data-parsed/`
- `parsed-cleaned.csv`: Final deduplicated extracted data.
- `parsed.csv`: Data before deduplication.
- `/data-raw/`: Output of `waybackpack`, organised by timestamp and URL.
- `/samples/`: Examples of the downloaded HTML, an individual StumbleUpon link, and the resulting CSV data.
- `/url-analysis/`: The raw URLs from `parsed-cleaned.csv`, plus their status codes using `vl`.
- `clean_stumbleupon_metadata.py`: Tool to deduplicate a CSV by `id` field (convert `parsed.csv` into `parsed-cleaned.csv`).
- `extract_stumbleupon_metadata.py`: Tool to extract contents of downloaded StumbleUpon pages (convert `data-raw` contents into `parsed.csv`).
- `analyse_stumbleupon_metadata.py`: Misc code to analyse the parsed data. This changes as required, full scripts available in original article.## How to recreate results?
To recreate the final output ([`parsed-cleaned.csv`](/data-parsed/parsed-cleaned.csv)):
1. Install Python dependencies (`pip install beautifulsoup4 lxml pandas`)
2. Run Wayback Machine download script (`waybackpack http://www.stumbleupon.com/discover/toprated/ -d "/Projects/StumbleUpon-extract/data-raw"`)
3. Run parsing script (`python extract_stumbleupon_metadata.py`)
4. Run deduping script (`python clean_stumbleupon_metadata.py`)