https://github.com/chipscoco/oceanmonkey
OceanMonkey is a High-Level Distributed Web Crawling and Web Scraping framework base on multi-process and multi-coroutines, used to crawl websites and extract structured data from their pages like the classical scrapy framework.
https://github.com/chipscoco/oceanmonkey
coroutines crawler multiprocessing python python3 scraper scraping spider
Last synced: 9 months ago
JSON representation
OceanMonkey is a High-Level Distributed Web Crawling and Web Scraping framework base on multi-process and multi-coroutines, used to crawl websites and extract structured data from their pages like the classical scrapy framework.
- Host: GitHub
- URL: https://github.com/chipscoco/oceanmonkey
- Owner: chipscoco
- License: apache-2.0
- Created: 2021-11-29T07:04:14.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2022-03-27T23:37:02.000Z (about 4 years ago)
- Last Synced: 2025-09-07T21:06:59.254Z (10 months ago)
- Topics: coroutines, crawler, multiprocessing, python, python3, scraper, scraping, spider
- Language: Python
- Homepage:
- Size: 86.9 KB
- Stars: 7
- Watchers: 1
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Overview
========
OceanMonkey is a High-Level Distributed Web Crawling and Web Scraping framework base on multi-process and multi-coroutines, used to
crawl websites and extract structured data from their pages like the classical scrapy framework.
## Installation guide
### Supported OS
OceanMonkey works on Linux, Windows and macOS.
### Supported Python versions
OceanMonkey requires Python 3.5+, either the CPython implementation.
### Installing
if you’re already familiar with installation of Python packages, you can install OceanMonkey and its dependencies from PyPI with:
pip install oceanmonkey
Also you can install OceanMonkey by dowloading the project's source code and install it through the setup.py:
python setup.py install
## Quick Start
### Create a Monkey Project
use the monkeys command to create a OceanMonkey Project like the following:
monkeys startproject BeBe
or:
monkeys strtproject D:\BeBe
### Write the scraping logic
when you execute the startproject command, it will generates two Python script file under the monkeys' directory,
namely **gibbons.py** and **orangutans.py**. just write the gibbons.py for scraping.
### Write the store logic
just write the **orangutans.py** for cleaning and storing items extracted from page source.
### Run the project
it's so easy to run the project, just execute the run command under the project's directory.
cd BeBe
monkeys run
# Sample code
```
from oceanmonkey import Gibbon
from oceanmonkey import Request
from oceanmonkey import Signal,SignalValue
class WuKong(Gibbon):
handle_httpstatus_list = [404, 500]
allowed_domains = ['www.chipscoco.com']
start_id = 9
def parse(self, response):
if response.status_code in self.handle_httpstatus_list or response.repeated:
self.start_id += 1
next_url = "http://www.chipscoco.com/?id={}".format(self.start_id)
yield Request(url=next_url, callback=self.parse)
else:
item = {}
item['author'] = response.xpath('//span[@class="mr20"]/text()').extract_first()
item['title'] = response.xpath('//h1[@class="f-22 mb15"]/text()').extract_first()
yield item
self.start_id += 1
next_url = "http://www.chipscoco.com/?id={}".format(self.start_id)
yield Request(url=next_url, callback=self.parse)
yield Signal(value=SignalValue.SAY_GOODBYE)
```
detailed usage on OceanMonkey see [https://github.com/chipscoco/OceanMonkey/tree/main/docs](https://github.com/chipscoco/OceanMonkey/tree/main/docs).
## Contact
|Author | Email | Wechat |
| ---------------|:----------------:| -----------:|
| chenzhengqiang | chenzhengqiang@chipscoco.com | Pretty-Style |
**Notice: Any comments and suggestions are welcomed**