Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cocrawler/cocrawler
CoCrawler is a versatile web crawler built using modern tools and concurrency.
https://github.com/cocrawler/cocrawler
aiohttp aiohttp-client async-python concurrency crawler pluggable-modules python3 screenshot warc
Last synced: 3 months ago
JSON representation
CoCrawler is a versatile web crawler built using modern tools and concurrency.
- Host: GitHub
- URL: https://github.com/cocrawler/cocrawler
- Owner: cocrawler
- License: apache-2.0
- Created: 2016-07-15T23:46:31.000Z (over 8 years ago)
- Default Branch: main
- Last Pushed: 2022-04-29T15:49:14.000Z (over 2 years ago)
- Last Synced: 2024-07-12T08:35:18.780Z (6 months ago)
- Topics: aiohttp, aiohttp-client, async-python, concurrency, crawler, pluggable-modules, python3, screenshot, warc
- Language: Python
- Homepage:
- Size: 911 KB
- Stars: 178
- Watchers: 20
- Forks: 25
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-crawler - CoCrawler - A versatile web crawler built using modern tools and concurrency. (Python)
README
# CoCrawler
[![Build Status](https://github.com/cocrawler/cocrawler/actions/workflows/test-all.yml/badge.svg)](https://github.com/cocrawler/cocrawler/actions/workflows/test-all.yml) [![Coverage Status](https://coveralls.io/repos/github/cocrawler/cocrawler/badge.svg?branch=main)](https://coveralls.io/github/cocrawler/cocrawler?branch=main) [![Apache License 2.0](https://img.shields.io/github/license/cocrawler/cocrawler.svg)](LICENSE)
CoCrawler is a versatile web crawler built using modern tools and
concurrency.Crawling the web can be easy or hard, depending upon the details.
Mature crawlers like Nutch and Heritrix work great in many situations,
and fall short in others. Some of the most demanding crawl situations
include open-ended crawling of the whole web.The object of this project is to create a modular crawler with
pluggable modules, capable of working well for a large variety of
crawl tasks. The core of the crawler is written in Python 3.7+ using
coroutines.## Status
CoCrawler is pre-release, with major restructuring going on. It is
currently able to crawl at around 170 megabits / 170 pages/sec on a 4
core machine.Screenshot: ![Screenshot](https://cloud.githubusercontent.com/assets/2142266/19621581/92e83044-9849-11e6-825d-66b674cc59f0.png "Screenshot")
## Installing
We recommend that you use pyenv / virtualenv to separate the python
executables and packages used by cocrawler from everything else.You can install cocrawler from pypi using "pip install cocrawler".
For a more fresh version, clone the repo and install like this:
```
git clone https://github.com/cocrawler/cocrawler.git
cd cocrawler
pip install . .[test]
make pytest
make test_coverage
```The CI for this repo uses the latest versions of everything. To see
exactly what worked last, click on the "Build Status" link above.
Alternately, you can look at `requirements.txt` for a test combination
that I probably ran before checking in.## Credits
CoCrawler draws on ideas from the Python 3.4 code in "500 Lines or
Less", which can be found at https://github.com/aosabook/500lines.
It is also heavily influenced by the experiences that Greg acquired
while working at blekko and the Internet Archive.## License
Apache 2.0