https://github.com/weaming/simple-crawler

my simple crawler
https://github.com/weaming/simple-crawler

crawler

Last synced: 30 days ago
JSON representation

my simple crawler

Host: GitHub
URL: https://github.com/weaming/simple-crawler
Owner: weaming
Created: 2019-04-12T03:07:57.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-12-08T05:55:06.000Z (over 2 years ago)
Last Synced: 2025-06-13T10:07:38.453Z (30 days ago)
Topics: crawler
Language: Python
Homepage:
Size: 31.3 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        ## Install

`pip3 install simple-crawler`

Set environment `AUTO_CHARSET=1` to pass `bytes` to beautifulsoup4 and let it detect the charset.

## Classes

* `URL`: define a URL

* `URLExt`: class to handle `URL`

* `Page`: define a request result of a `URL`

    * `url`: type `URL`

    * `content`, `text`, `json`: response content properties from library `requests`

    * `type`: the response body type, is a enum which allows `BYTES`, `TEXT`, `HTML`, `JSON`

    * `is_html`: check whether is html accorrding to the response headers's `Content-Type`

    * `soup`: `BeautifulSoup` contains html if `is_html`

* `Crawler`: schedule the crawler by calling `handler_page()` recusively

## Example

```

from simple_crawler import *

class MyCrawler(Crawler):

    name = 'output.txt'

    aysnc def custom_handle_page(self, page):

        print(page.url)

        tags = page.soup.select("#container")

        tag = tags and tags[0]

        with open(self.name, 'a') as f:

            f.write(tag.text)

        # do some async call

    def filter_url(self, url: URL) -> bool:

        return url.url.startswith("https://xxx.com/xxx")

loop = get_event_loop(True)

c = MyCrawler("https://xxx.com/xxx", loop, concurrency=10)

schedule_future_in_loop(c.start(), loop=loop)

```

## TODO

* [x] Speed up using async or threading

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/weaming/simple-crawler

Awesome Lists containing this project

README