https://github.com/hubertroy/seen
A lightweight crawling/spider framework for everyone(support JavaScript!).:sparkles:
https://github.com/hubertroy/seen
easy-to-use javasciprt lightweight-framework python3 spider-framework support-javascript web-crawling
Last synced: 4 months ago
JSON representation
A lightweight crawling/spider framework for everyone(support JavaScript!).:sparkles:
- Host: GitHub
- URL: https://github.com/hubertroy/seen
- Owner: HuberTRoy
- Created: 2017-11-20T03:46:35.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-07-19T05:34:30.000Z (almost 7 years ago)
- Last Synced: 2025-03-20T17:38:11.280Z (4 months ago)
- Topics: easy-to-use, javasciprt, lightweight-framework, python3, spider-framework, support-javascript, web-crawling
- Language: Python
- Homepage:
- Size: 82 KB
- Stars: 13
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Seen
![]()
![]()
![]()
##
Seen is a lightweight web crawling framework for everyone.
Written with `asyncio`,`aiohttp/requests`.It is useful for writing a web crawling quickly and get **FULL JavaScript Support**.
**Working Process:**
## Requirements:
* Python 3.5+
* aiohttp or requests
* pyquery## Installation:
```
pip install seen
```Get JavaScript support!
```
pip install pyppeteer
```## Usage:
1. Write spider.py
```python
from seen import Spider, Parser, Item, Cssclass Post(Item):
title = Css('title')
img = Css('img', 'src')def save(self):
print(self.result['title'])
print(self.result['img'])class MySpider(Spider):
roots = 'https://www.v2ex.com'
url_limit = ('www.v2ex.com')
concurrency = 1
# if you want to load JavaScript, set use_browser = True
# by default is False.
use_browser = Falseparsers = [Parser(Post)]
if __name__ == '__main__':
spider = MySpider()spider.start()
```2. Run `python spider.py`.
3. Check result.## Contribution
* Pull request.
* Open an issue.