Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aratz-lasa/webgraph
Given an initial url, it produces a graph showing how it links to other pages, and those pages to others, and so on...
https://github.com/aratz-lasa/webgraph
async beautifulsoup4 graph neo4j python python3 redis set trio
Last synced: 13 days ago
JSON representation
Given an initial url, it produces a graph showing how it links to other pages, and those pages to others, and so on...
- Host: GitHub
- URL: https://github.com/aratz-lasa/webgraph
- Owner: aratz-lasa
- Created: 2019-02-16T21:58:14.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2021-04-20T17:51:42.000Z (over 3 years ago)
- Last Synced: 2024-11-07T15:53:47.495Z (2 months ago)
- Topics: async, beautifulsoup4, graph, neo4j, python, python3, redis, set, trio
- Language: Python
- Homepage:
- Size: 849 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# webGraph
**webGraph** analyzes recursively a web page and creates a graph map showing how it connects to other web pages.This project is **async**, except when querying databases (it uses a thread for every database access). It is built using [Trio](https://github.com/python-trio/trio).
These are system's main components:
* *Donwloader*: downloads the web pages' HTMLs.
* *Crawler*: analyzes the html and extracts, filters, and tramsforms urls that links to in the HTML.
* *Dumper*: dumps the relationships between the web page and its links into a graph database. After dumping data, it also analyzes the urls in order pass it to *Downloader*.These three parts are independent, they execute in separate coroutines. This gives the flexibility to adjust different number of workers for each component. For example, there can be 4 *Downloaders*, 1 *Crawler* and 2 *Dumpers*.
**webGraph** uses a **MapReduce** model. *Downloader* executes mapping and *Crawler* reduces it. The three components are connected by queues in a circular architecture. The messages that flow between them follows this order: *Downloader*-*Crawler*-*Dumper*-*Downloader*...
## Dependencies
Python requirements are specied in *requirements.txt*.There are two python-external depenedencies:
1. [Neo4j](https://neo4j.com/download/)1. [Redis](https://redis.io/download)
* Apart from installing and starting Neo4j and Redis, you also need to add the following environment variables to python:
* NEO4J_USER=
* NEO4J_PASSWORD=
* NEO4J_URL=
* REDIS_URL=
* REDIS_PORT=
These databases can be swapped by any other graph database and/or a set database. Inside *webGraph/utils/_abc.py* there are the database abstract classes. So in order to swap a database, you only need to implement those abstract classes.## Usage
This is the way to analyze a web page using **webGraph**:
```bash
python -m webGraph.start_web_graph {web page}
```
or
```bash
python -m webGraph.start_web_graph {web page} {downloaders amount} {crawler amount} {dumpers amount}
```
If you do not specify the amounts, it is the same as doing:
```bash
python -m webGraph.start_web_graph {web page} 100 2 2
```
## Real Example
### Wikipedia
```bash
python -m webGraph.start_web_graph https://en.wikipedia.org/wiki/Wikipedia
```
![Wikipedia graph](images/WikipediaGraph.png)
### Amazon
```bash
python -m webGraph.start_web_graph https://www.amazon.com/
```
![Wikipedia graph](images/AmazonGraph.png)## Testing
In order to executing tests, you must install *dev_requirements.txt*