Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hubertroy/seen

A lightweight crawling/spider framework for everyone(support JavaScript!).:sparkles:
https://github.com/hubertroy/seen

easy-to-use javasciprt lightweight-framework python3 spider-framework support-javascript web-crawling

Last synced: about 2 months ago
JSON representation

A lightweight crawling/spider framework for everyone(support JavaScript!).:sparkles:

Awesome Lists containing this project

README

        

Seen






##

Seen is a lightweight web crawling framework for everyone.
Written with `asyncio`,`aiohttp/requests`.

It is useful for writing a web crawling quickly and get **FULL JavaScript Support**.

**Working Process:**
![workingProcess](https://github.com/HuberTRoy/seen/blob/master/img/process.png)

## Requirements:
* Python 3.5+
* aiohttp or requests
* pyquery

## Installation:
```
pip install seen
```

Get JavaScript support!
```
pip install pyppeteer
```

## Usage:

1. Write spider.py
```python
from seen import Spider, Parser, Item, Css

class Post(Item):
title = Css('title')
img = Css('img', 'src')

def save(self):

print(self.result['title'])
print(self.result['img'])

class MySpider(Spider):
roots = 'https://www.v2ex.com'
url_limit = ('www.v2ex.com')
concurrency = 1
# if you want to load JavaScript, set use_browser = True
# by default is False.
use_browser = False

parsers = [Parser(Post)]

if __name__ == '__main__':
spider = MySpider()

spider.start()
```

2. Run `python spider.py`.
3. Check result.

## Contribution

* Pull request.
* Open an issue.