Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jamesjarvis/web-graph
Experiment with web scraping
https://github.com/jamesjarvis/web-graph
colly crawler database golang web-graph
Last synced: 23 days ago
JSON representation
Experiment with web scraping
- Host: GitHub
- URL: https://github.com/jamesjarvis/web-graph
- Owner: jamesjarvis
- License: agpl-3.0
- Created: 2020-09-04T18:43:05.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2022-11-10T23:25:43.000Z (almost 2 years ago)
- Last Synced: 2024-06-20T22:37:50.819Z (5 months ago)
- Topics: colly, crawler, database, golang, web-graph
- Language: Go
- Homepage: https://jamesjarvis.github.io/web-graph/
- Size: 351 KB
- Stars: 0
- Watchers: 2
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Web Graph
> Experiment with web scraping
View it live!
If you want to start from a different url, you can change the query string!
(Note that you can only look at urls that are indirectly discoverable from the root jamesjarvis.io).Example:
The basic idea of this is that I wanted to be able to crawl from a single URL, and scrape the entire tree of links it can traverse.
Rough overview:
Crawler is given a url.
It first checks that this url has not been crawled already, if it has, then it just moves on.
Then it checks that the url is accessible, it'll do some small exponential backoof, but then returns PageDeadError
If it can, it will download the page source, and scrape all 'a' elements, and the href attribute from that.
Then it sends all these scraped URL's to the back of a list, and the process repeats.Essentially, this is a breadth first crawl of the whole internet, or at least until either my 1TB hard drive runs out of space, or virgin media cuts me off.
## The API
If you want to mess about with the API directly, you need to know that the "id" of each page is calculated as the following:
> SHA1(hostname + pathname).hex()
If you want to find out the id's of pages found on a particular host, you can use:
If you want to find info of a page, along with the id's of pages linked *from* this page, use:
If you want to find the links *to* a page (v useful for discovering backlinks), use:
## To run
```bash
docker-compose up --build -d && docker-compose logs -f link-processor
```Then open and enter your credentials from [Your database environment file](./database.env.example)
Note, if running this on an rpi, stop the pgadmin service with `docker compose stop pgadmin` as it is not compiled for ARM.To see the UI, open the `frontend/index.html` file in a browser.
## DB Schema### Page
| Page ID (PK) (generated as hash of host+path) | Host | Path | Url |
| --------------------------------------------- | ---------------- | --------------- | ------------------------------------ |
| 1 (hash of host+path) | jamesjarvis.io | / | https://jamesjarvis.io/ |
| 2 (hash of host+path) | en.wikipedia.com | /united-kingdom | https://wikipedia.com/united-kingdom |### Link
| FromPageID (FK) | ToPageID (FK) | Link text |
| --------------- | ------------- | ---------------- |
| 1 | 2 | I live in the UK |