https://github.com/hightman/pspider
纯 PHP 开发的并行抓取工具 (Parallel web crawler written in PHP)
https://github.com/hightman/pspider
Last synced: 8 months ago
JSON representation
纯 PHP 开发的并行抓取工具 (Parallel web crawler written in PHP)
- Host: GitHub
- URL: https://github.com/hightman/pspider
- Owner: hightman
- Created: 2013-03-08T08:47:47.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2015-09-16T09:21:38.000Z (about 10 years ago)
- Last Synced: 2024-10-29T17:51:40.053Z (about 1 year ago)
- Language: PHP
- Homepage:
- Size: 191 KB
- Stars: 265
- Watchers: 41
- Forks: 110
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-crawler - pspider - Parallel web crawler written in PHP. (PHP)
- awesome-crawler-cn - pspider - 基于PHP的并发网络爬虫. (PHP)
README
PHP - spider 框架
===================
这是最近使用纯 `php` 代码开发的并行抓取(爬虫)框架,基于 [hightman\httpclient](https://github.com/hightman/httpclient) 组件。
您必须先装有 [composer](http://getcomposer.org),然后在项目里先运行以下命令下载组件:
~~~
composer install
~~~
使用 pspider
--------------
这里头的 URL 表管理需要 MySQLi 扩展支持,表结构和自定义的内容参见自定义文件。
1. 复制 `custom/skel.inc.php` 为 `custom/your.inc.php`
2. 根据说明修改 custom/your.inc.php
3. 根据 custom/your.inc.php 里的注释创建 mysql 的 URL 表
4. 运行 spider.php -u http://... 即可开始循环抓取
5. UrlTable 的实现很简单仅作示例,具体可自行重做