Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jamesjarvis/web-graph

Experiment with web scraping
https://github.com/jamesjarvis/web-graph

colly crawler database golang web-graph

Last synced: 23 days ago
JSON representation

Experiment with web scraping

Awesome Lists containing this project

README

        

# Web Graph

> Experiment with web scraping

View it live!

If you want to start from a different url, you can change the query string!
(Note that you can only look at urls that are indirectly discoverable from the root jamesjarvis.io).

Example:

The basic idea of this is that I wanted to be able to crawl from a single URL, and scrape the entire tree of links it can traverse.

Rough overview:

Crawler is given a url.
It first checks that this url has not been crawled already, if it has, then it just moves on.
Then it checks that the url is accessible, it'll do some small exponential backoof, but then returns PageDeadError
If it can, it will download the page source, and scrape all 'a' elements, and the href attribute from that.
Then it sends all these scraped URL's to the back of a list, and the process repeats.

Essentially, this is a breadth first crawl of the whole internet, or at least until either my 1TB hard drive runs out of space, or virgin media cuts me off.

## The API

If you want to mess about with the API directly, you need to know that the "id" of each page is calculated as the following:

> SHA1(hostname + pathname).hex()

If you want to find out the id's of pages found on a particular host, you can use:

If you want to find info of a page, along with the id's of pages linked *from* this page, use:

If you want to find the links *to* a page (v useful for discovering backlinks), use:

## To run

```bash
docker-compose up --build -d && docker-compose logs -f link-processor
```

Then open and enter your credentials from [Your database environment file](./database.env.example)
Note, if running this on an rpi, stop the pgadmin service with `docker compose stop pgadmin` as it is not compiled for ARM.

To see the UI, open the `frontend/index.html` file in a browser.
## DB Schema

### Page

| Page ID (PK) (generated as hash of host+path) | Host | Path | Url |
| --------------------------------------------- | ---------------- | --------------- | ------------------------------------ |
| 1 (hash of host+path) | jamesjarvis.io | / | https://jamesjarvis.io/ |
| 2 (hash of host+path) | en.wikipedia.com | /united-kingdom | https://wikipedia.com/united-kingdom |

### Link

| FromPageID (FK) | ToPageID (FK) | Link text |
| --------------- | ------------- | ---------------- |
| 1 | 2 | I live in the UK |