https://github.com/manning23/MSpider

Spider
https://github.com/manning23/MSpider

mspider

Last synced: 8 months ago
JSON representation

Spider

Host: GitHub
URL: https://github.com/manning23/MSpider
Owner: manning23
License: gpl-2.0
Created: 2014-11-15T05:59:08.000Z (about 11 years ago)
Default Branch: master
Last Pushed: 2022-07-11T12:03:58.000Z (over 3 years ago)
Last Synced: 2024-10-29T17:51:15.315Z (about 1 year ago)
Topics: mspider
Language: Python
Size: 1.2 MB
Stars: 348
Watchers: 55
Forks: 192
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-crawler - MSpider - A simple ,easy spider using gevent and js render. (Python)
awesome-crawler-cn - MSpider - 一个基于gevent(协程网络库)的python爬虫. (Python)

README

          # MSpider

## Talk

The information security department of 360 company has been recruiting for a long time and is interested in contacting the mailbox zhangxin1[at]360.cn.

## Installation

In Ubuntu, you need to install some libraries.

You can use pip or easy_install or apt-get to do this.

- lxml

- chardet

- splinter

- gevent

- phantomjs

## Example

1. Use MSpider collect the vulnerability information on the wooyun.org.

```

	python mspider.py -u "http://www.wooyun.org/bugs/" --focus-domain "wooyun.org" --filter-keyword "xxx" --focus-keyword "bugs" -t 15 --random-agent true

```

2. Use MSpider collect the news information on the news.sina.com.cn.

```

	python mspider.py -u "http://news.sina.com.cn/c/2015-12-20/doc-ifxmszek7395594.shtml" --focus-domain "news.sina.com.cn"  -t 15 --random-agent true

```

## ToDo

1. Crawl and storage of information.

2. Distributed crawling.

## MSpider's help

```

Usage:

  __  __  _____       _     _

 |  \/  |/ ____|     (_)   | |

 | \  / | (___  _ __  _  __| | ___ _ __

 | |\/| |\___ \| '_ \| |/ _` |/ _ \ '__|

 | |  | |____) | |_) | | (_| |  __/ |

 |_|  |_|_____/| .__/|_|\__,_|\___|_|

               | |

               |_|

                        Author: Manning23

Options:

  -h, --help            show this help message and exit

  -u MSPIDER_URL, --url=MSPIDER_URL

                        Target URL (e.g. "http://www.site.com/")

  -t MSPIDER_THREADS_NUM, --threads=MSPIDER_THREADS_NUM

                        Max number of concurrent HTTP(s) requests (default 10)

  --depth=MSPIDER_DEPTH

                        Crawling depth

  --count=MSPIDER_COUNT

                        Crawling number

  --time=MSPIDER_TIME   Crawl time

  --referer=MSPIDER_REFERER

                        HTTP Referer header value

  --cookies=MSPIDER_COOKIES

                        HTTP Cookie header value

  --spider-model=MSPIDER_MODEL

                        Crawling mode: Static_Spider: 0  Dynamic_Spider: 1

                        Mixed_Spider: 2

  --spider-policy=MSPIDER_POLICY

                        Crawling strategy: Breadth-first 0  Depth-first 1

                        Random-first 2

  --focus-keyword=MSPIDER_FOCUS_KEYWORD

                        Focus keyword in URL

  --filter-keyword=MSPIDER_FILTER_KEYWORD

                        Filter keyword in URL

  --filter-domain=MSPIDER_FILTER_DOMAIN

                        Filter domain

  --focus-domain=MSPIDER_FOCUS_DOMAIN

                        Focus domain

  --random-agent=MSPIDER_AGENT

                        Use randomly selected HTTP User-Agent header value

  --print-all=MSPIDER_PRINT_ALL

                        Will show more information

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/manning23/MSpider

Awesome Lists containing this project

README