https://github.com/shirokovnv/webcrawler

The service for crawling websites.
https://github.com/shirokovnv/webcrawler

cassandra elixir-phoenix parser webcrawler

Last synced: 7 months ago
JSON representation

The service for crawling websites.

Host: GitHub
URL: https://github.com/shirokovnv/webcrawler
Owner: shirokovnv
License: mit
Created: 2022-09-11T10:42:34.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2023-02-21T09:16:06.000Z (almost 3 years ago)
Last Synced: 2025-05-07T09:38:18.376Z (9 months ago)
Topics: cassandra, elixir-phoenix, parser, webcrawler
Language: Elixir
Homepage:
Size: 110 KB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

          # Webcrawler

![ci.yml][link-ci]

**The service for crawling websites (experimental)**

## Dependencies

- [Docker][link-docker]

- [Make][link-make]

- [Phoenix framework][link-phx]

- [Redis][link-redis] for jobs processing

- [Cassandra][link-cassandra] for the persistent storage

## Project setup

**From the project root, inside shell, run:**

- `make pull` to pull latest images

- `make init` to install fresh dependencies

- `make up` to run app containers

Now you can visit [`localhost:4000`](http://localhost:4000) from your browser.

- `make down` - to extinguish running containers

- `make help` - for additional commands

## Howitworks

1. The user adds new source URL -> new async job started

2. Inside the job:

- Normalize URL (validate schema, remove trailing slash, etc...)

- Store link in DB, if link already exists, than exit

- Parse HTML links and metadata

- Store it in different tables

- Normalize links, check wether it relational or not.

- Check links are external

- For each non-external link -> schedule new async job with some random interval

3. Thats literally it

**To see it in action, go to the** [localhost:4000/crawl](http://localhost:4000/crawl) **and type any kind of URL.**

**To see some search results visit** [localhost:4000/search](http://localhost:4000/search).

## Database schema

The default keyspace is `storage`

**Tables:**

- `site_statistics` contains source URLs and counting parsed links

- `sites` contains URL and HTML parsed

- `sites_by_meta` contains URL and parsed metadata

For `LIKE`-style search queries [SASI][link-sasi] index needs to be configured.

See `schema.cql` and `cassandra.yaml` for more detail.

## Useful links

- Visit [localhost:4000/jobs](http://localhost:4000/jobs) to see crawling jobs in action

- Visit [localhost:4000/dashboard](http://localhost:4000/dashboard) to see core metrics of the system

## License

MIT. Please see the [license file](LICENSE.md) for more information.

[link-ci]: https://github.com/shirokovnv/webcrawler/actions/workflows/ci.yml/badge.svg

[link-cassandra]: https://cassandra.apache.org/

[link-sasi]: https://cassandra.apache.org/doc/4.1/cassandra/cql/SASI.html

[link-docker]: https://www.docker.com/

[link-make]: https://www.gnu.org/software/make/manual/make.html

[link-redis]: https://redis.io/

[link-phx]: https://www.phoenixframework.org/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shirokovnv/webcrawler

Awesome Lists containing this project

README