Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dmuhs/pastebin-scraper
Live-scraping pastebin to fight boredom.
https://github.com/dmuhs/pastebin-scraper
multithreading mysql pastebin python scraper sqlite
Last synced: 3 months ago
JSON representation
Live-scraping pastebin to fight boredom.
- Host: GitHub
- URL: https://github.com/dmuhs/pastebin-scraper
- Owner: dmuhs
- Created: 2016-06-01T23:54:20.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2022-06-18T16:27:26.000Z (over 2 years ago)
- Last Synced: 2023-02-26T18:01:52.553Z (almost 2 years ago)
- Topics: multithreading, mysql, pastebin, python, scraper, sqlite
- Language: Python
- Homepage:
- Size: 47.9 KB
- Stars: 95
- Watchers: 3
- Forks: 70
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## pastebin-scraper
This is a multithreaded scraping script for [Pastebin](http://pastebin.com/). It scrapes the main site for new pastes, downloads their raw content and processes them by a user-defined output format.
### WHY?
Fun.### Installation
The usual dance.
```
pip install -r requirements.txt
```Define all required specs in `settings.ini`. Should you decide to go with a database output, make sure the respective connector is installed. At the moment MySQL with `pymysql` and SQLite with the standard built in Python 3 connector are supported.
Also note that the file output creates a subdirectory `output` and dumps every paste as a separate file into it.
### Settings
`ini` is a highly underrated file format. Here are some definitions on what the settings parameter actually do.#### GENERAL
- `PasteLimit` Stop after having scraped n pastes. Set to 0 for indefinite scraping
- `PBLink` URL to Pastebin or another equivalent site
- `DownloadWorkers` Number of workers that download the raw paste content and further process it
- `NewPasteCheckInterval` Time to wait before checking the main site for new pastes again
- `IPBlockedWaitTime` Time to wait until checking the main site again after the scraper's IP has been blocked#### LOGGING
- `RotationLog` Location of log file that contains debug output
- `MaxRotationSize` Size in bytes before another log file is created
- `RotationBackupCount` Maximum number of log files to keep#### STDOUT/ FILE
- `Enable` Enable formatted stdout output of paste data
- `ContentDisplayLimit` Maximum amount of characters to show before content is cut off (0 to display all)
- `ShowName` Display the paste name
- `ShowLang` Display the paste language
- `ShowLink` Display the complete paste link
- `ShowData` Display the raw paste content
- `DataEncoding` Encoding of the raw paste data#### MYSQL
- `Enable` Enable MySQL output
- `TableName` Main table name to insert data into
- `Host` MySQL server host
- `Port` MySQL server port
- `Username` MySQL server user
- `Password` User password#### SQLITE
- `Enable` Enable SQLite output
- `Filename` Filename the db should be saved as (usually ends with .db)
- `TableName` Main table name to insert data into---
If you use this thing for some cool data analysis or even research, let me know if I can help!
Inspiration for this scraper was taken from [here](http://www.michielovertoom.com/python/pastebin-abused/).