https://github.com/jamesjarvis/web-graph

Experiment with web scraping
https://github.com/jamesjarvis/web-graph

colly crawler database golang web-graph

Last synced: 4 months ago
JSON representation

Experiment with web scraping

Host: GitHub
URL: https://github.com/jamesjarvis/web-graph
Owner: jamesjarvis
License: agpl-3.0
Created: 2020-09-04T18:43:05.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2022-11-10T23:25:43.000Z (over 2 years ago)
Last Synced: 2025-01-30T03:43:22.203Z (6 months ago)
Topics: colly, crawler, database, golang, web-graph
Language: Go
Homepage: https://jamesjarvis.github.io/web-graph/
Size: 351 KB
Stars: 0
Watchers: 2
Forks: 2
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Web Graph

> Experiment with web scraping

View it live!

If you want to start from a different url, you can change the query string!
(Note that you can only look at urls that are indirectly discoverable from the root jamesjarvis.io).

Example:

The basic idea of this is that I wanted to be able to crawl from a single URL, and scrape the entire tree of links it can traverse.

Rough overview:

Crawler is given a url.
It first checks that this url has not been crawled already, if it has, then it just moves on.
Then it checks that the url is accessible, it'll do some small exponential backoof, but then returns PageDeadError
If it can, it will download the page source, and scrape all 'a' elements, and the href attribute from that.
Then it sends all these scraped URL's to the back of a list, and the process repeats.

Essentially, this is a breadth first crawl of the whole internet, or at least until either my 1TB hard drive runs out of space, or virgin media cuts me off.

## The API

If you want to mess about with the API directly, you need to know that the "id" of each page is calculated as the following:

> SHA1(hostname + pathname).hex()

If you want to find out the id's of pages found on a particular host, you can use:

If you want to find info of a page, along with the id's of pages linked *from* this page, use:

If you want to find the links *to* a page (v useful for discovering backlinks), use:

## To run

```bash
docker-compose up --build -d && docker-compose logs -f link-processor
```

Then open and enter your credentials from [Your database environment file](./database.env.example)
Note, if running this on an rpi, stop the pgadmin service with `docker compose stop pgadmin` as it is not compiled for ARM.

To see the UI, open the `frontend/index.html` file in a browser.
## DB Schema

### Page

| Page ID (PK) (generated as hash of host+path) | Host | Path | Url |
| --------------------------------------------- | ---------------- | --------------- | ------------------------------------ |
| 1 (hash of host+path) | jamesjarvis.io | / | https://jamesjarvis.io/ |
| 2 (hash of host+path) | en.wikipedia.com | /united-kingdom | https://wikipedia.com/united-kingdom |

### Link

| FromPageID (FK) | ToPageID (FK) | Link text |
| --------------- | ------------- | ---------------- |
| 1 | 2 | I live in the UK |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jamesjarvis/web-graph

Awesome Lists containing this project

README