Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/debugtalk/webcrawler
A web crawler based on requests-html, mainly targets for url validation test.
https://github.com/debugtalk/webcrawler
crawler requests-html web-crawler weblink
Last synced: about 1 month ago
JSON representation
A web crawler based on requests-html, mainly targets for url validation test.
- Host: GitHub
- URL: https://github.com/debugtalk/webcrawler
- Owner: debugtalk
- License: mit
- Created: 2017-03-24T13:34:26.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2020-03-23T08:00:39.000Z (almost 5 years ago)
- Last Synced: 2024-11-01T10:42:31.435Z (about 2 months ago)
- Topics: crawler, requests-html, web-crawler, weblink
- Language: Python
- Homepage:
- Size: 81.1 KB
- Stars: 32
- Watchers: 4
- Forks: 12
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# WebCrawler
A simple web crawler, mainly targets for link validation test.
## Features
- running in BFS or DFS mode
- specify concurrent running workers in BFS mode
- crawl seeds can be set to more than one urls
- support crawl with cookies
- configure hyper links regex, including match type and ignore type
- group visited urls by HTTP status code
- flexible configuration in YAML
- send test result by mail, through SMTP protocol or mailgun service
- cancel jobs## Installation/Upgrade
```bash
$ pip install -U git+https://github.com/debugtalk/WebCrawler.git#egg=requests-crawler --process-dependency-links
```To ensure the installation or upgrade is successful, you can execute command `webcrawler -V` to see if you can get the correct version number.
```bash
$ webcrawler -V
jenkins-mail-py version: 0.2.4
WebCrawler version: 0.3.0
```## Usage
```text
$ webcrawler -h
usage: webcrawler [-h] [-V] [--log-level LOG_LEVEL]
[--config-file CONFIG_FILE] [--seeds SEEDS]
[--include-hosts INCLUDE_HOSTS] [--cookies COOKIES]
[--crawl-mode CRAWL_MODE] [--max-depth MAX_DEPTH]
[--concurrency CONCURRENCY] [--save-results SAVE_RESULTS]
[--grey-user-agent GREY_USER_AGENT]
[--grey-traceid GREY_TRACEID]
[--grey-view-grey GREY_VIEW_GREY]
[--mailgun-api-id MAILGUN_API_ID]
[--mailgun-api-key MAILGUN_API_KEY]
[--mail-sender MAIL_SENDER]
[--mail-recepients [MAIL_RECEPIENTS [MAIL_RECEPIENTS ...]]]
[--mail-subject MAIL_SUBJECT] [--mail-content MAIL_CONTENT]
[--jenkins-job-name JENKINS_JOB_NAME]
[--jenkins-job-url JENKINS_JOB_URL]
[--jenkins-build-number JENKINS_BUILD_NUMBER]A web crawler for testing website links validation.
optional arguments:
-h, --help show this help message and exit
-V, --version show version
--log-level LOG_LEVEL
Specify logging level, default is INFO.
--config-file CONFIG_FILE
Specify config file path.
--seeds SEEDS Specify crawl seed url(s), several urls can be
specified with pipe; if auth needed, seeds can be
specified like user1:pwd1@url1|user2:pwd2@url2
--include-hosts INCLUDE_HOSTS
Specify extra hosts to be crawled.
--cookies COOKIES Specify cookies, several cookies can be joined by '|'.
e.g. 'lang:en,country:us|lang:zh,country:cn'
--crawl-mode CRAWL_MODE
Specify crawl mode, BFS or DFS.
--max-depth MAX_DEPTH
Specify max crawl depth.
--concurrency CONCURRENCY
Specify concurrent workers number.
--save-results SAVE_RESULTS
Specify if save results, default is NO.
--grey-user-agent GREY_USER_AGENT
Specify grey environment header User-Agent.
--grey-traceid GREY_TRACEID
Specify grey environment cookie traceid.
--grey-view-grey GREY_VIEW_GREY
Specify grey environment cookie view_gray.
--mailgun-api-id MAILGUN_API_ID
Specify mailgun api id.
--mailgun-api-key MAILGUN_API_KEY
Specify mailgun api key.
--mail-sender MAIL_SENDER
Specify email sender.
--mail-recepients [MAIL_RECEPIENTS [MAIL_RECEPIENTS ...]]
Specify email recepients.
--mail-subject MAIL_SUBJECT
Specify email subject.
--mail-content MAIL_CONTENT
Specify email content.
--jenkins-job-name JENKINS_JOB_NAME
Specify jenkins job name.
--jenkins-job-url JENKINS_JOB_URL
Specify jenkins job url.
--jenkins-build-number JENKINS_BUILD_NUMBER
Specify jenkins build number.
```## Examples
Specify config file.
```bash
$ webcrawler --seeds http://debugtalk.com --crawl-mode bfs --max-depth 5 --config-file path/to/config.yml
```Crawl in BFS mode with 20 concurrent workers, and set maximum depth to 5.
```bash
$ webcrawler --seeds http://debugtalk.com --crawl-mode bfs --max-depth 5 --concurrency 20
```Crawl in DFS mode, and set maximum depth to 10.
```bash
$ webcrawler --seeds http://debugtalk.com --crawl-mode dfs --max-depth 10
```Crawl several websites in BFS mode with 20 concurrent workers, and set maximum depth to 10.
```bash
$ webcrawler --seeds http://debugtalk.com,http://blog.debugtalk.com --crawl-mode bfs --max-depth 10 --concurrency 20
```Crawl with different cookies.
```text
$ webcrawler --seeds http://debugtalk.com --crawl-mode BFS --max-depth 10 --concurrency 50 --cookies 'lang:en,country:us|lang:zh,country:cn'
```## Supported Python Versions
WebCrawler supports Python 2.7, 3.3, 3.4, 3.5, and 3.6.
## License
Open source licensed under the MIT license (see LICENSE file for details).