https://github.com/shirokovnv/webcrawler
The service for crawling websites.
https://github.com/shirokovnv/webcrawler
cassandra elixir-phoenix parser webcrawler
Last synced: 7 months ago
JSON representation
The service for crawling websites.
- Host: GitHub
- URL: https://github.com/shirokovnv/webcrawler
- Owner: shirokovnv
- License: mit
- Created: 2022-09-11T10:42:34.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2023-02-21T09:16:06.000Z (almost 3 years ago)
- Last Synced: 2025-05-07T09:38:18.376Z (9 months ago)
- Topics: cassandra, elixir-phoenix, parser, webcrawler
- Language: Elixir
- Homepage:
- Size: 110 KB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Webcrawler
![ci.yml][link-ci]
**The service for crawling websites (experimental)**
## Dependencies
- [Docker][link-docker]
- [Make][link-make]
- [Phoenix framework][link-phx]
- [Redis][link-redis] for jobs processing
- [Cassandra][link-cassandra] for the persistent storage
## Project setup
**From the project root, inside shell, run:**
- `make pull` to pull latest images
- `make init` to install fresh dependencies
- `make up` to run app containers
Now you can visit [`localhost:4000`](http://localhost:4000) from your browser.
- `make down` - to extinguish running containers
- `make help` - for additional commands
## Howitworks
1. The user adds new source URL -> new async job started
2. Inside the job:
- Normalize URL (validate schema, remove trailing slash, etc...)
- Store link in DB, if link already exists, than exit
- Parse HTML links and metadata
- Store it in different tables
- Normalize links, check wether it relational or not.
- Check links are external
- For each non-external link -> schedule new async job with some random interval
3. Thats literally it
**To see it in action, go to the** [localhost:4000/crawl](http://localhost:4000/crawl) **and type any kind of URL.**
**To see some search results visit** [localhost:4000/search](http://localhost:4000/search).
## Database schema
The default keyspace is `storage`
**Tables:**
- `site_statistics` contains source URLs and counting parsed links
- `sites` contains URL and HTML parsed
- `sites_by_meta` contains URL and parsed metadata
For `LIKE`-style search queries [SASI][link-sasi] index needs to be configured.
See `schema.cql` and `cassandra.yaml` for more detail.
## Useful links
- Visit [localhost:4000/jobs](http://localhost:4000/jobs) to see crawling jobs in action
- Visit [localhost:4000/dashboard](http://localhost:4000/dashboard) to see core metrics of the system
## License
MIT. Please see the [license file](LICENSE.md) for more information.
[link-ci]: https://github.com/shirokovnv/webcrawler/actions/workflows/ci.yml/badge.svg
[link-cassandra]: https://cassandra.apache.org/
[link-sasi]: https://cassandra.apache.org/doc/4.1/cassandra/cql/SASI.html
[link-docker]: https://www.docker.com/
[link-make]: https://www.gnu.org/software/make/manual/make.html
[link-redis]: https://redis.io/
[link-phx]: https://www.phoenixframework.org/