Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/weaming/simple-crawler
my simple crawler
https://github.com/weaming/simple-crawler
crawler
Last synced: 4 days ago
JSON representation
my simple crawler
- Host: GitHub
- URL: https://github.com/weaming/simple-crawler
- Owner: weaming
- Created: 2019-04-12T03:07:57.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T05:55:06.000Z (almost 2 years ago)
- Last Synced: 2024-11-11T01:51:47.225Z (6 days ago)
- Topics: crawler
- Language: Python
- Homepage:
- Size: 31.3 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Install
`pip3 install simple-crawler`
Set environment `AUTO_CHARSET=1` to pass `bytes` to beautifulsoup4 and let it detect the charset.
## Classes
* `URL`: define a URL
* `URLExt`: class to handle `URL`
* `Page`: define a request result of a `URL`
* `url`: type `URL`
* `content`, `text`, `json`: response content properties from library `requests`
* `type`: the response body type, is a enum which allows `BYTES`, `TEXT`, `HTML`, `JSON`
* `is_html`: check whether is html accorrding to the response headers's `Content-Type`
* `soup`: `BeautifulSoup` contains html if `is_html`
* `Crawler`: schedule the crawler by calling `handler_page()` recusively## Example
```
from simple_crawler import *class MyCrawler(Crawler):
name = 'output.txt'
aysnc def custom_handle_page(self, page):
print(page.url)
tags = page.soup.select("#container")
tag = tags and tags[0]
with open(self.name, 'a') as f:
f.write(tag.text)
# do some async calldef filter_url(self, url: URL) -> bool:
return url.url.startswith("https://xxx.com/xxx")loop = get_event_loop(True)
c = MyCrawler("https://xxx.com/xxx", loop, concurrency=10)
schedule_future_in_loop(c.start(), loop=loop)
```## TODO
* [x] Speed up using async or threading