https://github.com/deavid/scradaway
Scraper in Python Deavid's way
https://github.com/deavid/scradaway
Last synced: 9 months ago
JSON representation
Scraper in Python Deavid's way
- Host: GitHub
- URL: https://github.com/deavid/scradaway
- Owner: deavid
- License: mit
- Created: 2017-05-22T17:01:59.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-05-22T17:04:00.000Z (about 9 years ago)
- Last Synced: 2025-03-06T11:30:28.927Z (over 1 year ago)
- Language: Python
- Size: 6.84 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# scradaway
Scraper in Python Deavid's way
This is a fairly simple scraper that uses PostgreSQL as a download queue.
It reads config.xml, connects to the database and from there resumes the last
download queue.
Features
----------
- Multiple sites at once.
- Allows for including/excluding url's using regular expression. Multiple
expressions are allowed, one per line.
- Extracts data as JSON, using CSS selectors. You decide how to name your
json object keys (Properties) without having to modify the underlying table
structure.
- Multithreading: by default, up to 16 threads are used to download at once.
- Cloud-ready: as Scradaway uses PostgreSQL internal
locking as means of correct queuing, any instance of scradaway that connects
to the same database will colaborate with the other instances.
- Different instances of Scradaway can work with different sites and connect
to the same database.
- Download is always resumed from the last state.
Quickstart
-----------------
To begin doing some work, simply adapt config.xml on scradaway/ folder to your
needs and do:
$ seq 2 | xargs -P8 -n1 -t python3 scradaway.py
This will start two workers with up to 16 threads each one. You can launch as
many instances as cores you have, but beware each thread uses a separate
connection to PostgreSQL, so you will have to modify postgresql.conf:
max_connections = 2000 # by default PostgreSQL comes with 100
And after that you will have to restart PostgreSQL (reload isn't enough).