https://github.com/tinram/login-spider
Spider through a website login and process the pages behind it.
https://github.com/tinram/login-spider
log-in login login-spider python scraper spider website website-scraper
Last synced: 10 months ago
JSON representation
Spider through a website login and process the pages behind it.
- Host: GitHub
- URL: https://github.com/tinram/login-spider
- Owner: Tinram
- License: gpl-3.0
- Created: 2018-02-08T14:21:08.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2019-11-22T09:12:51.000Z (over 6 years ago)
- Last Synced: 2025-03-20T12:48:34.380Z (over 1 year ago)
- Topics: log-in, login, login-spider, python, scraper, spider, website, website-scraper
- Language: Python
- Homepage:
- Size: 15.6 KB
- Stars: 1
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Login Spider
#### Spider website pages protected by a login.
## Purpose
Log-in to a website to access the area of a registered user, then spider the page links and process the page content.
## Requirements
+ Python 2.6+
+ pycurl
## Background
Having to use Windows on-site, Python dependencies were somewhat restricted to build a spider (`pip` would only install some, *Beautiful Soup* was not one of them).
## Usage
Configure the website access details in the *CONFIG* section of *login_spider.py*.
(Viewing the website login form's HTML source will be needed to configure the *FORM_POST* string, as each site will use something different.)
Execute:
python login_spider.py
## Speed
Dependent on CPU and OS, approximately 35 seconds to process a 200 page website with a localhost connection (zero network overhead).
## Credits
jfs and philshem for threading pools in Python.
## License
Login Spider is released under the [GPL v.3](https://www.gnu.org/licenses/gpl-3.0.html).