https://github.com/liron-li/livaspider
使用python编写的异步io爬虫,编写少量的代码即可轻松的爬取目标页面
https://github.com/liron-li/livaspider
asyncio python-3-6 spider
Last synced: 4 months ago
JSON representation
使用python编写的异步io爬虫,编写少量的代码即可轻松的爬取目标页面
- Host: GitHub
- URL: https://github.com/liron-li/livaspider
- Owner: liron-li
- Created: 2017-05-05T07:26:15.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2018-05-25T07:15:13.000Z (about 8 years ago)
- Last Synced: 2025-08-25T08:06:33.379Z (10 months ago)
- Topics: asyncio, python-3-6, spider
- Language: Python
- Homepage:
- Size: 10.7 KB
- Stars: 6
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
> 最近学到了python的asyncIO,于是便利用空余时间写了这个小爬虫
##### 环境
- python3.6
##### 依赖:
- beautifulsoup4==4.5.3
- requests==2.13.0
- alembic==0.9.1
- SQLAlchemy==1.1.9
##### 目录结构
```angular2html
├── alembic # alembic 目录
│ ├── env.py # alembic env配置文件
│ ├── README
│ ├── script.py.mako # alembic 模板文件
│ └── versions # 表迁移文件
│ └── 404fa70bcf2c_create_tables.py
├── alembic.ini # alembic 配置文件
├── core
│ ├── crawling.py # 爬虫基类
│ ├── __init__.py
│ └── models.py # sqlAlchemy模型
├── example_crawl_baike.py # example 爬取百度百科
└── README.md
```
##### 如何使用?
- 数据库配置
修改`alembic.ini`文件中的`sqlalchemy.url`
```angular2html
sqlalchemy.url = driver://user:pass@localhost/dbname
```
- 生成表迁移文件
```angular2html
alembic revision --autogenerate -m "your desc"
```
- 执行迁移
```angular2html
alembic upgrade head
```
- 爬虫配置
```angular2html
config = {
# 请求头
"headers": headers,
# cookies
"cookies": cookies,
# 根url
"base_url": "http://baike.baidu.com/",
# 起始url
"start_url": "http://baike.baidu.com/item/%E9%93%81%E6%A0%91/110475",
# 抓取的网站正则
"url_rule": r'^http://baike.baidu.com/item/',
}
```
爬虫运行时会抓取`start_url`中的符合`url_rule`正则的所有url存入数据库做为爬取的目标url,
直至`url_pool`表中的所有记录都爬取完,爬虫结束
- 运行爬虫
```angular2html
python example_crawl_baike.py
```