Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/elliotgao2/gain
Web crawling framework based on asyncio.
https://github.com/elliotgao2/gain
aiohttp asyncio crawler python spider uvloop
Last synced: 4 days ago
JSON representation
Web crawling framework based on asyncio.
- Host: GitHub
- URL: https://github.com/elliotgao2/gain
- Owner: elliotgao2
- License: gpl-3.0
- Created: 2017-05-31T08:56:04.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-06-01T22:54:09.000Z (over 5 years ago)
- Last Synced: 2024-12-07T08:32:26.281Z (5 days ago)
- Topics: aiohttp, asyncio, crawler, python, spider, uvloop
- Language: Python
- Homepage:
- Size: 214 KB
- Stars: 2,036
- Watchers: 75
- Forks: 208
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-hacking-lists - elliotgao2/gain - Web crawling framework based on asyncio. (Python)
README
#
[![Build](https://travis-ci.org/gaojiuli/gain.svg?branch=master)](https://travis-ci.org/gaojiuli/gain)
[![Python](https://img.shields.io/pypi/pyversions/gain.svg)](https://pypi.python.org/pypi/gain/)
[![Version](https://img.shields.io/pypi/v/gain.svg)](https://pypi.python.org/pypi/gain/)
[![License](https://img.shields.io/pypi/l/gain.svg)](https://pypi.python.org/pypi/gain/)Web crawling framework for everyone. Written with `asyncio`, `uvloop` and `aiohttp`.
![](img/architecture.png)
## Requirements
- Python3.5+
## Installation
`pip install gain`
`pip install uvloop` (Only linux)
## Usage
1. Write spider.py:
```python
from gain import Css, Item, Parser, Spider
import aiofilesclass Post(Item):
title = Css('.entry-title')
content = Css('.entry-content')async def save(self):
async with aiofiles.open('scrapinghub.txt', 'a+') as f:
await f.write(self.results['title'])class MySpider(Spider):
concurrency = 5
headers = {'User-Agent': 'Google Spider'}
start_url = 'https://blog.scrapinghub.com/'
parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]MySpider.run()
```Or use XPathParser:
```python
from gain import Css, Item, Parser, XPathParser, Spiderclass Post(Item):
title = Css('.breadcrumb_last')async def save(self):
print(self.title)class MySpider(Spider):
start_url = 'https://mydramatime.com/europe-and-us-drama/'
concurrency = 5
headers = {'User-Agent': 'Google Spider'}
parsers = [
XPathParser('//span[@class="category-name"]/a/@href'),
XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
]
proxy = 'https://localhost:1234'MySpider.run()
```
You can add proxy setting to spider as above.2. Run `python spider.py`
3. Result:
![](img/sample.png)
## Example
The examples are in the `/example/` directory.
## Contribution
- Pull request.
- Open issue.