Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aratz-lasa/webgraph

Given an initial url, it produces a graph showing how it links to other pages, and those pages to others, and so on...
https://github.com/aratz-lasa/webgraph

async beautifulsoup4 graph neo4j python python3 redis set trio

Last synced: about 2 months ago
JSON representation

Given an initial url, it produces a graph showing how it links to other pages, and those pages to others, and so on...

Host: GitHub
URL: https://github.com/aratz-lasa/webgraph
Owner: aratz-lasa
Created: 2019-02-16T21:58:14.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2021-04-20T17:51:42.000Z (almost 4 years ago)
Last Synced: 2024-11-07T15:53:47.495Z (3 months ago)
Topics: async, beautifulsoup4, graph, neo4j, python, python3, redis, set, trio
Language: Python
Homepage:
Size: 849 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# webGraph
**webGraph** analyzes recursively a web page and creates a graph map showing how it connects to other web pages.

This project is **async**, except when querying databases (it uses a thread for every database access). It is built using [Trio](https://github.com/python-trio/trio).

These are system's main components:
* *Donwloader*: downloads the web pages' HTMLs.
* *Crawler*: analyzes the html and extracts, filters, and tramsforms urls that links to in the HTML.
* *Dumper*: dumps the relationships between the web page and its links into a graph database. After dumping data, it also analyzes the urls in order pass it to *Downloader*.

These three parts are independent, they execute in separate coroutines. This gives the flexibility to adjust different number of workers for each component. For example, there can be 4 *Downloaders*, 1 *Crawler* and 2 *Dumpers*.

**webGraph** uses a **MapReduce** model. *Downloader* executes mapping and *Crawler* reduces it. The three components are connected by queues in a circular architecture. The messages that flow between them follows this order: *Downloader*-*Crawler*-*Dumper*-*Downloader*...

## Dependencies
Python requirements are specied in *requirements.txt*.

There are two python-external depenedencies:
1. [Neo4j](https://neo4j.com/download/)

1. [Redis](https://redis.io/download)

* Apart from installing and starting Neo4j and Redis, you also need to add the following environment variables to python:
* NEO4J_USER=
* NEO4J_PASSWORD=
* NEO4J_URL=
* REDIS_URL=
* REDIS_PORT=

These databases can be swapped by any other graph database and/or a set database. Inside *webGraph/utils/_abc.py* there are the database abstract classes. So in order to swap a database, you only need to implement those abstract classes.

## Usage
This is the way to analyze a web page using **webGraph**:
```bash
python -m webGraph.start_web_graph {web page}
```
or
```bash
python -m webGraph.start_web_graph {web page} {downloaders amount} {crawler amount} {dumpers amount}
```
If you do not specify the amounts, it is the same as doing:
```bash
python -m webGraph.start_web_graph {web page} 100 2 2
```
## Real Example
### Wikipedia
```bash
python -m webGraph.start_web_graph https://en.wikipedia.org/wiki/Wikipedia
```
![Wikipedia graph](images/WikipediaGraph.png)
### Amazon
```bash
python -m webGraph.start_web_graph https://www.amazon.com/
```
![Wikipedia graph](images/AmazonGraph.png)

## Testing
In order to executing tests, you must install *dev_requirements.txt*