{"id":20811225,"url":"https://github.com/shirokovnv/webcrawler","last_synced_at":"2025-07-21T04:06:16.617Z","repository":{"id":103423502,"uuid":"535265970","full_name":"shirokovnv/webcrawler","owner":"shirokovnv","description":"The service for crawling websites.","archived":false,"fork":false,"pushed_at":"2023-02-21T09:16:06.000Z","size":113,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-05-07T09:38:18.376Z","etag":null,"topics":["cassandra","elixir-phoenix","parser","webcrawler"],"latest_commit_sha":null,"homepage":"","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shirokovnv.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-11T10:42:34.000Z","updated_at":"2023-08-21T10:53:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"e4c6e4c9-9045-4d4b-a393-5d836ac1e99c","html_url":"https://github.com/shirokovnv/webcrawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/shirokovnv/webcrawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shirokovnv%2Fwebcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shirokovnv%2Fwebcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shirokovnv%2Fwebcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shirokovnv%2Fwebcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shirokovnv","download_url":"https://codeload.github.com/shirokovnv/webcrawler/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shirokovnv%2Fwebcrawler/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266236669,"owners_count":23897222,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cassandra","elixir-phoenix","parser","webcrawler"],"created_at":"2024-11-17T20:38:11.443Z","updated_at":"2025-07-21T04:06:16.583Z","avatar_url":"https://github.com/shirokovnv.png","language":"Elixir","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Webcrawler\n\n![ci.yml][link-ci]\n\n**The service for crawling websites (experimental)**\n\n## Dependencies\n\n- [Docker][link-docker]\n- [Make][link-make]\n- [Phoenix framework][link-phx]\n- [Redis][link-redis] for jobs processing\n- [Cassandra][link-cassandra] for the persistent storage\n\n## Project setup\n\n**From the project root, inside shell, run:**\n\n- `make pull` to pull latest images\n- `make init` to install fresh dependencies\n- `make up` to run app containers\n\nNow you can visit [`localhost:4000`](http://localhost:4000) from your browser.\n\n- `make down` - to extinguish running containers\n- `make help` - for additional commands\n\n## Howitworks\n\n1. The user adds new source URL -\u003e new async job started\n2. Inside the job:\n\n- Normalize URL (validate schema, remove trailing slash, etc...)\n- Store link in DB, if link already exists, than exit\n- Parse HTML links and metadata\n- Store it in different tables\n- Normalize links, check wether it relational or not.\n- Check links are external\n- For each non-external link -\u003e schedule new async job with some random interval\n\n3. Thats literally it\n\n**To see it in action, go to the** [localhost:4000/crawl](http://localhost:4000/crawl) **and type any kind of URL.**\n\n**To see some search results visit** [localhost:4000/search](http://localhost:4000/search).\n\n## Database schema\n\nThe default keyspace is `storage`\n\n**Tables:**\n\n- `site_statistics` contains source URLs and counting parsed links\n- `sites` contains URL and HTML parsed\n- `sites_by_meta` contains URL and parsed metadata\n\nFor `LIKE`-style search queries [SASI][link-sasi] index needs to be configured.\n\nSee `schema.cql` and `cassandra.yaml` for more detail.\n\n## Useful links\n\n- Visit [localhost:4000/jobs](http://localhost:4000/jobs) to see crawling jobs in action\n- Visit [localhost:4000/dashboard](http://localhost:4000/dashboard) to see core metrics of the system\n\n## License\n\nMIT. Please see the [license file](LICENSE.md) for more information.\n\n[link-ci]: https://github.com/shirokovnv/webcrawler/actions/workflows/ci.yml/badge.svg\n[link-cassandra]: https://cassandra.apache.org/\n[link-sasi]: https://cassandra.apache.org/doc/4.1/cassandra/cql/SASI.html\n[link-docker]: https://www.docker.com/\n[link-make]: https://www.gnu.org/software/make/manual/make.html\n[link-redis]: https://redis.io/\n[link-phx]: https://www.phoenixframework.org/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshirokovnv%2Fwebcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshirokovnv%2Fwebcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshirokovnv%2Fwebcrawler/lists"}