https://github.com/manning23/MSpider
Spider
https://github.com/manning23/MSpider
mspider
Last synced: 8 months ago
JSON representation
Spider
- Host: GitHub
- URL: https://github.com/manning23/MSpider
- Owner: manning23
- License: gpl-2.0
- Created: 2014-11-15T05:59:08.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2022-07-11T12:03:58.000Z (over 3 years ago)
- Last Synced: 2024-10-29T17:51:15.315Z (about 1 year ago)
- Topics: mspider
- Language: Python
- Size: 1.2 MB
- Stars: 348
- Watchers: 55
- Forks: 192
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-crawler - MSpider - A simple ,easy spider using gevent and js render. (Python)
- awesome-crawler-cn - MSpider - 一个基于gevent(协程网络库)的python爬虫. (Python)
README
# MSpider
## Talk
The information security department of 360 company has been recruiting for a long time and is interested in contacting the mailbox zhangxin1[at]360.cn.
## Installation
In Ubuntu, you need to install some libraries.
You can use pip or easy_install or apt-get to do this.
- lxml
- chardet
- splinter
- gevent
- phantomjs
## Example
1. Use MSpider collect the vulnerability information on the wooyun.org.
```
python mspider.py -u "http://www.wooyun.org/bugs/" --focus-domain "wooyun.org" --filter-keyword "xxx" --focus-keyword "bugs" -t 15 --random-agent true
```
2. Use MSpider collect the news information on the news.sina.com.cn.
```
python mspider.py -u "http://news.sina.com.cn/c/2015-12-20/doc-ifxmszek7395594.shtml" --focus-domain "news.sina.com.cn" -t 15 --random-agent true
```
## ToDo
1. Crawl and storage of information.
2. Distributed crawling.
## MSpider's help
```
Usage:
__ __ _____ _ _
| \/ |/ ____| (_) | |
| \ / | (___ _ __ _ __| | ___ _ __
| |\/| |\___ \| '_ \| |/ _` |/ _ \ '__|
| | | |____) | |_) | | (_| | __/ |
|_| |_|_____/| .__/|_|\__,_|\___|_|
| |
|_|
Author: Manning23
Options:
-h, --help show this help message and exit
-u MSPIDER_URL, --url=MSPIDER_URL
Target URL (e.g. "http://www.site.com/")
-t MSPIDER_THREADS_NUM, --threads=MSPIDER_THREADS_NUM
Max number of concurrent HTTP(s) requests (default 10)
--depth=MSPIDER_DEPTH
Crawling depth
--count=MSPIDER_COUNT
Crawling number
--time=MSPIDER_TIME Crawl time
--referer=MSPIDER_REFERER
HTTP Referer header value
--cookies=MSPIDER_COOKIES
HTTP Cookie header value
--spider-model=MSPIDER_MODEL
Crawling mode: Static_Spider: 0 Dynamic_Spider: 1
Mixed_Spider: 2
--spider-policy=MSPIDER_POLICY
Crawling strategy: Breadth-first 0 Depth-first 1
Random-first 2
--focus-keyword=MSPIDER_FOCUS_KEYWORD
Focus keyword in URL
--filter-keyword=MSPIDER_FILTER_KEYWORD
Filter keyword in URL
--filter-domain=MSPIDER_FILTER_DOMAIN
Filter domain
--focus-domain=MSPIDER_FOCUS_DOMAIN
Focus domain
--random-agent=MSPIDER_AGENT
Use randomly selected HTTP User-Agent header value
--print-all=MSPIDER_PRINT_ALL
Will show more information
```