{"id":13464667,"url":"https://github.com/cocrawler/cocrawler","last_synced_at":"2025-12-14T03:04:53.767Z","repository":{"id":9932350,"uuid":"63456910","full_name":"cocrawler/cocrawler","owner":"cocrawler","description":"CoCrawler is a versatile web crawler built using modern tools and concurrency.","archived":false,"fork":false,"pushed_at":"2022-04-29T15:49:14.000Z","size":933,"stargazers_count":190,"open_issues_count":0,"forks_count":25,"subscribers_count":19,"default_branch":"main","last_synced_at":"2025-03-16T09:09:14.942Z","etag":null,"topics":["aiohttp","aiohttp-client","async-python","concurrency","crawler","pluggable-modules","python3","screenshot","warc"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cocrawler.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-07-15T23:46:31.000Z","updated_at":"2025-03-03T04:17:07.000Z","dependencies_parsed_at":"2022-08-08T03:00:06.106Z","dependency_job_id":null,"html_url":"https://github.com/cocrawler/cocrawler","commit_stats":null,"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cocrawler%2Fcocrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cocrawler%2Fcocrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cocrawler%2Fcocrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cocrawler%2Fcocrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cocrawler","download_url":"https://codeload.github.com/cocrawler/cocrawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245454098,"owners_count":20617976,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aiohttp","aiohttp-client","async-python","concurrency","crawler","pluggable-modules","python3","screenshot","warc"],"created_at":"2024-07-31T14:00:48.331Z","updated_at":"2025-12-14T03:04:48.715Z","avatar_url":"https://github.com/cocrawler.png","language":"Python","readme":"# CoCrawler\n\n[![Build Status](https://github.com/cocrawler/cocrawler/actions/workflows/test-all.yml/badge.svg)](https://github.com/cocrawler/cocrawler/actions/workflows/test-all.yml) [![Coverage Status](https://coveralls.io/repos/github/cocrawler/cocrawler/badge.svg?branch=main)](https://coveralls.io/github/cocrawler/cocrawler?branch=main) [![Apache License 2.0](https://img.shields.io/github/license/cocrawler/cocrawler.svg)](LICENSE)\n\nCoCrawler is a versatile web crawler built using modern tools and\nconcurrency.\n\nCrawling the web can be easy or hard, depending upon the details.\nMature crawlers like Nutch and Heritrix work great in many situations,\nand fall short in others. Some of the most demanding crawl situations\ninclude open-ended crawling of the whole web.\n\nThe object of this project is to create a modular crawler with\npluggable modules, capable of working well for a large variety of\ncrawl tasks. The core of the crawler is written in Python 3.7+ using\ncoroutines.\n\n## Status\n\nCoCrawler is pre-release, with major restructuring going on. It is\ncurrently able to crawl at around 170 megabits / 170 pages/sec on a 4\ncore machine.\n\nScreenshot: ![Screenshot](https://cloud.githubusercontent.com/assets/2142266/19621581/92e83044-9849-11e6-825d-66b674cc59f0.png \"Screenshot\")\n\n## Installing\n\nWe recommend that you use pyenv / virtualenv to separate the python\nexecutables and packages used by cocrawler from everything else.\n\nYou can install cocrawler from pypi using \"pip install cocrawler\".\n\nFor a more fresh version, clone the repo and install like this:\n\n```\ngit clone https://github.com/cocrawler/cocrawler.git\ncd cocrawler\npip install . .[test]\nmake pytest\nmake test_coverage\n```\n\nThe CI for this repo uses the latest versions of everything.  To see\nexactly what worked last, click on the \"Build Status\" link above.\nAlternately, you can look at `requirements.txt` for a test combination\nthat I probably ran before checking in.\n\n## Credits\n\nCoCrawler draws on ideas from the Python 3.4 code in \"500 Lines or\nLess\", which can be found at https://github.com/aosabook/500lines.\nIt is also heavily influenced by the experiences that Greg acquired\nwhile working at blekko and the Internet Archive.\n\n## License\n\nApache 2.0\n","funding_links":[],"categories":["All","Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcocrawler%2Fcocrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcocrawler%2Fcocrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcocrawler%2Fcocrawler/lists"}