https://github.com/liron-li/livaspider

使用python编写的异步io爬虫，编写少量的代码即可轻松的爬取目标页面
https://github.com/liron-li/livaspider

asyncio python-3-6 spider

Last synced: 4 months ago
JSON representation

使用python编写的异步io爬虫，编写少量的代码即可轻松的爬取目标页面

Host: GitHub
URL: https://github.com/liron-li/livaspider
Owner: liron-li
Created: 2017-05-05T07:26:15.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2018-05-25T07:15:13.000Z (about 8 years ago)
Last Synced: 2025-08-25T08:06:33.379Z (10 months ago)
Topics: asyncio, python-3-6, spider
Language: Python
Homepage:
Size: 10.7 KB
Stars: 6
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

> 最近学到了python的asyncIO，于是便利用空余时间写了这个小爬虫

##### 环境
- python3.6

##### 依赖：
- beautifulsoup4==4.5.3
- requests==2.13.0
- alembic==0.9.1
- SQLAlchemy==1.1.9

##### 目录结构
```angular2html
├── alembic # alembic 目录
│   ├── env.py # alembic env配置文件
│   ├── README
│   ├── script.py.mako # alembic 模板文件
│   └── versions # 表迁移文件
│   └── 404fa70bcf2c_create_tables.py
├── alembic.ini # alembic 配置文件
├── core
│   ├── crawling.py # 爬虫基类
│   ├── __init__.py
│   └── models.py # sqlAlchemy模型
├── example_crawl_baike.py # example 爬取百度百科
└── README.md
```

##### 如何使用？

- 数据库配置
修改`alembic.ini`文件中的`sqlalchemy.url`
```angular2html
sqlalchemy.url = driver://user:pass@localhost/dbname
```
- 生成表迁移文件
```angular2html
alembic revision --autogenerate -m "your desc"
```

- 执行迁移
```angular2html
alembic upgrade head
```
- 爬虫配置
```angular2html
config = {
# 请求头
"headers": headers,
# cookies
"cookies": cookies,
# 根url
"base_url": "http://baike.baidu.com/",
# 起始url
"start_url": "http://baike.baidu.com/item/%E9%93%81%E6%A0%91/110475",
# 抓取的网站正则
"url_rule": r'^http://baike.baidu.com/item/',
}
```
爬虫运行时会抓取`start_url`中的符合`url_rule`正则的所有url存入数据库做为爬取的目标url，
直至`url_pool`表中的所有记录都爬取完，爬虫结束

- 运行爬虫
```angular2html
python example_crawl_baike.py
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/liron-li/livaspider

Awesome Lists containing this project

README