{"id":15570903,"url":"https://github.com/nicolaslm/crawler","last_synced_at":"2026-04-27T02:33:13.113Z","repository":{"id":82950410,"uuid":"50541177","full_name":"NicolasLM/crawler","owner":"NicolasLM","description":"Crawl the web in Python for fun","archived":false,"fork":false,"pushed_at":"2016-01-31T17:38:18.000Z","size":8,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-19T06:43:04.366Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NicolasLM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-01-27T22:04:42.000Z","updated_at":"2016-01-27T22:58:49.000Z","dependencies_parsed_at":"2023-04-23T20:27:14.251Z","dependency_job_id":null,"html_url":"https://github.com/NicolasLM/crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NicolasLM%2Fcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NicolasLM%2Fcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NicolasLM%2Fcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NicolasLM%2Fcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NicolasLM","download_url":"https://codeload.github.com/NicolasLM/crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243223425,"owners_count":20256504,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-02T17:49:15.678Z","updated_at":"2025-12-26T02:53:33.730Z","avatar_url":"https://github.com/NicolasLM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Crawl the web for fun\n=====================\n\nHave you ever heard statistics like \"half of the websites in the world run\nApache\" or \"the number one hosting company in the US is xxx\"? Have you ever\nwondered how these figures were calculated? Well I do and I was a bit\nskeptical, so I've decided to write my own crawler in Python to check this by\nmyself.\n\nFortunately Python makes this super easy. Basically the whole program is:\n\n* fetch the homepage of a domain with requests\n* search for all the links to external domains with Beautiful Soup\n* schedule a Celery job for these domains\n* repeat\n\nThe crawler only checks the homepage of each domain. Why that? Because hitting\nonce every website in the world sounds possible, however hitting once every\npage of every website must be quite costly. The downside is that it will\nprobably miss a few domains.\n\nGetting information about networks\n----------------------------------\n\nIn order to display useful information this program needs to fetch data about\nthe network hosting a website. This is usually done with the Maxmind GeoIP\ndatabase. However it is not freely available, so instead it uses two different\ndatabases:\n\n* GeoLite2 Country from Maxmind\n* An ASN database generated from routeviews.org (more on that later)\n\nInstallation\n------------\n\nThis program is written in Python 3. Start by cloning the repository:\n\n    git clone https://github.com/NicolasLM/crawler.git\n    cd crawler\n\nCreate a new virtualenv:\n\n    pyvenv venv\n    source venv/bin/activate\n\nInstall the package and its requirements:\n\n    pip install --editable .\n\nRun Redis which is used by Celery as broker and result backend:\n\n    docker run -d redis\n\nRun RethinkDB, a document store to save data about domains:\n\n    docker run -d rethinkdb rethinkdb --bind all\n\nDownload GeoLite2 Country from http://dev.maxmind.com/geoip/geoip2/geolite2/\n\nDownload and format the ASN db used by pyasn:\n\n    pyasn_util_download.py --latest\n    pyasn_util_convert.py --single rib.2016[...].bz2 ipasn.dat\n\nYou might want to tweak `crawler/conf.py` before initializing RethinkDB:\n\n    crawler rethinkdb\n\nUsage\n-----\n\nPut a single domain in the Celery task list:\n\n    crawler insert www.python.org\n\nRun 10 Celery workers in parallel:\n\n    celery worker -A crawler.crawler.app -c 10 -P threads -Ofair --loglevel INFO\n\nExplore the command line and get statistics:\n\n    $ crawler countries --count 5\n    Top 5 countries\n             France  711\n      United States  698\n              Japan  367\n        Netherlands  175\n            Germany  73\n\n\nLicense\n-------\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicolaslm%2Fcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnicolaslm%2Fcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicolaslm%2Fcrawler/lists"}