https://github.com/nicolaslm/crawler

Crawl the web in Python for fun
https://github.com/nicolaslm/crawler

Last synced: about 2 months ago
JSON representation

Crawl the web in Python for fun

Host: GitHub
URL: https://github.com/nicolaslm/crawler
Owner: NicolasLM
License: mit
Created: 2016-01-27T22:04:42.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2016-01-31T17:38:18.000Z (over 10 years ago)
Last Synced: 2025-01-19T06:43:04.366Z (over 1 year ago)
Language: Python
Size: 7.81 KB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Crawl the web for fun
=====================

Have you ever heard statistics like "half of the websites in the world run
Apache" or "the number one hosting company in the US is xxx"? Have you ever
wondered how these figures were calculated? Well I do and I was a bit
skeptical, so I've decided to write my own crawler in Python to check this by
myself.

Fortunately Python makes this super easy. Basically the whole program is:

* fetch the homepage of a domain with requests
* search for all the links to external domains with Beautiful Soup
* schedule a Celery job for these domains
* repeat

The crawler only checks the homepage of each domain. Why that? Because hitting
once every website in the world sounds possible, however hitting once every
page of every website must be quite costly. The downside is that it will
probably miss a few domains.

Getting information about networks
----------------------------------

In order to display useful information this program needs to fetch data about
the network hosting a website. This is usually done with the Maxmind GeoIP
database. However it is not freely available, so instead it uses two different
databases:

* GeoLite2 Country from Maxmind
* An ASN database generated from routeviews.org (more on that later)

Installation
------------

This program is written in Python 3. Start by cloning the repository:

git clone https://github.com/NicolasLM/crawler.git
cd crawler

Create a new virtualenv:

pyvenv venv
source venv/bin/activate

Install the package and its requirements:

pip install --editable .

Run Redis which is used by Celery as broker and result backend:

docker run -d redis

Run RethinkDB, a document store to save data about domains:

docker run -d rethinkdb rethinkdb --bind all

Download GeoLite2 Country from http://dev.maxmind.com/geoip/geoip2/geolite2/

Download and format the ASN db used by pyasn:

pyasn_util_download.py --latest
pyasn_util_convert.py --single rib.2016[...].bz2 ipasn.dat

You might want to tweak `crawler/conf.py` before initializing RethinkDB:

crawler rethinkdb

Usage
-----

Put a single domain in the Celery task list:

crawler insert www.python.org

Run 10 Celery workers in parallel:

celery worker -A crawler.crawler.app -c 10 -P threads -Ofair --loglevel INFO

Explore the command line and get statistics:

$ crawler countries --count 5
Top 5 countries
France 711
United States 698
Japan 367
Netherlands 175
Germany 73

License
-------

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nicolaslm/crawler

Awesome Lists containing this project

README