https://github.com/diegojromerolopez/relwrac

A basic crawler developed with python and asyncio
https://github.com/diegojromerolopez/relwrac

asyncio crawler page-rank python

Last synced: 8 months ago
JSON representation

A basic crawler developed with python and asyncio

Host: GitHub
URL: https://github.com/diegojromerolopez/relwrac
Owner: diegojromerolopez
License: mit
Created: 2019-11-03T22:09:35.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2022-12-08T06:49:05.000Z (almost 3 years ago)
Last Synced: 2023-05-11T04:01:16.933Z (over 2 years ago)
Topics: asyncio, crawler, page-rank, python
Language: Python
Size: 6.2 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# relwrac

A basic crawler developed with Python 3.7+
and [asyncio](https://docs.python.org/3/library/asyncio.html).

**WARNING: this project is in alpha stage and you should not be using it in production.**

## Installation

### System requirements

```bash
$ sudo apt install libcairo2 libxml2-dev
```

### Virtualenv requirements

Install packages specified by [requirements.txt](/requirements.txt).

## Crawler

[link_extractor.py](/src/link_extractor.py) is a Python program that allows to extract
all links included in pages from a start URL recursively.

### Example

Get all reddit links (starting from http://old.reddit.com).

Note there is no depth specified, hence it is 2 (default value).

Results will be saved in two files, a pickle file and a CSV file.

Pickle file will store `Crawl.links_by_url` attribute and CSV file
will store a line per crawled URL and a column for each link found
in that URL.

```bash
$ pwd
/home/my_user/relcraw
$ python3 src/link_extractor.py \
-u http://old.reddit.com \
-ureg '^https:\/\/(\w+\.)?reddit\.com$' \
-op reddit_domain.pickle \
-csv reddit_domain.csv
```

## Grapher

[grapher.py](/src/grapher.py) is a Python program that reads a
pickle file and creates a png image with the graph.

### Example

Print graph with reddit links (starting from http://old.reddit.com).

```bash
$ pwd
/home/my_user/relcraw
$ python3 src/grapher.py \
-p reddit_domain.pickle \
-o reddit_domain.png
```

## Page rank computer

[grapher.py](/src/grapher.py) reads the link adjacency matrix
and computes the page rank.

### Example

Compute the page rank of the reddit links
and store the page rank matrix in a reddit_domain-pr.pickle file.

```bash
$ pwd
/home/my_user/relcraw
$ python3 src/page_ranker.py \
-p /home/diegoj/proyectos/relwrac/samples/reddit_domain.pickle \
-o /home/diegoj/proyectos/relwrac/samples/reddit_domain-pr.pickle
```

## Link page rank

[link_page_rank.py](/src/link_page_rank.py) reads page rank and returns
the value of the page link of the passed as parameter.

### Example

Get the page rank of a reddit URL.

```bash
$ pwd
/home/my_user/relcraw
$ python3 src/link_page_rank.py \
-u https://old.reddit.com \
-pr /home/diegoj/proyectos/relwrac/samples/reddit_domain-pr.pickle
```

## TODO

- Show a graph of the crawled URLs.
- Filter URLs to crawl by regex.
- Compute page rank of the URLs.
- Limit crawling by time.
- Add multiprocessing to crawling process.
- Check if timeout works.
- Store links in database (optionally)
- Some documentation and examples.
- Measure performance.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/diegojromerolopez/relwrac

Awesome Lists containing this project

README