https://github.com/tinram/login-spider

Spider through a website login and process the pages behind it.
https://github.com/tinram/login-spider

log-in login login-spider python scraper spider website website-scraper

Last synced: 10 months ago
JSON representation

Spider through a website login and process the pages behind it.

Host: GitHub
URL: https://github.com/tinram/login-spider
Owner: Tinram
License: gpl-3.0
Created: 2018-02-08T14:21:08.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2019-11-22T09:12:51.000Z (over 6 years ago)
Last Synced: 2025-03-20T12:48:34.380Z (over 1 year ago)
Topics: log-in, login, login-spider, python, scraper, spider, website, website-scraper
Language: Python
Homepage:
Size: 15.6 KB
Stars: 1
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Login Spider

#### Spider website pages protected by a login.

## Purpose

Log-in to a website to access the area of a registered user, then spider the page links and process the page content.

## Requirements

+ Python 2.6+
+ pycurl

## Background

Having to use Windows on-site, Python dependencies were somewhat restricted to build a spider (`pip` would only install some, *Beautiful Soup* was not one of them).

## Usage

Configure the website access details in the *CONFIG* section of *login_spider.py*.

(Viewing the website login form's HTML source will be needed to configure the *FORM_POST* string, as each site will use something different.)

Execute:

python login_spider.py

## Speed

Dependent on CPU and OS, approximately 35 seconds to process a 200 page website with a localhost connection (zero network overhead).

## Credits

jfs and philshem for threading pools in Python.

## License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tinram/login-spider

Awesome Lists containing this project

README