{"id":21001606,"url":"https://github.com/moj124/web_crawler","last_synced_at":"2025-03-13T13:44:26.448Z","repository":{"id":115926176,"uuid":"417099685","full_name":"moj124/web_crawler","owner":"moj124","description":"The web_crawler is a asynchoronous gevent link crawler that maps all the associated local links constrained by the input webpage url.","archived":false,"fork":false,"pushed_at":"2021-10-22T10:01:36.000Z","size":776,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-20T09:46:51.997Z","etag":null,"topics":["crawler","crawler-python","links-spider"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/moj124.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-14T11:15:41.000Z","updated_at":"2021-11-03T18:25:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"11578214-b0ee-4893-b40c-f7c8405f078b","html_url":"https://github.com/moj124/web_crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moj124%2Fweb_crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moj124%2Fweb_crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moj124%2Fweb_crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moj124%2Fweb_crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/moj124","download_url":"https://codeload.github.com/moj124/web_crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243419051,"owners_count":20287803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawler-python","links-spider"],"created_at":"2024-11-19T08:15:53.052Z","updated_at":"2025-03-13T13:44:26.435Z","avatar_url":"https://github.com/moj124.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# web_crawler\nThe web_crawler is a asynchoronous gevent link crawler that maps all the associated local links constrained by the input webpage url.\n\nPLEASE MAKE SURE YOU RUN THE FOLLOWING COMMAND FOR CORRECT EXECUTION AND SAVING OF THE LOCAL LINK RELATIONS JSON FILE TO /DATA FOLDER:\n```cmd\npython3 crawl_website.py -l \u003curl\u003e -s True \n```\n\n- [Requirements](#requirements)\n- [Setup](#setup)\n    -   [Windows](#windows)\n    -   [Linux](#linux)\n- [Run Script](#run-script-(linux))\n    -   [Run Default with 'bbc.co.uk'](#run-default-settings-with-'bbc.co.uk')\n    -   [Run with custom options](#run-with-custom-options)\n- [Testing](#testing-1)\n- [Notes](#notes)\n    -   [Team Work \u0026 Planning](#team-work--planning)\n    -   [Deployment Testing](#deployment-testing)\n- [Issues](#issues)\n## Requirements\n- **Dependencies** (included in requirements.txt)\n    - bs4\n    - requests\n    - gevent\n  \n- **Python Version Tested**\n  - 3.7.10\n\n## Setup\n\n### Windows\n```cmd\npython -m venv venv\nvenv\\Scripts\\activate\npip install -r requirements.txt\n```\n\n### Linux\n```cmd\npython3 -m venv venv\nsource venv/bin/activate\npip3 install -r requirements.txt\n```\n## Run Script (Linux)\n\n### Run default settings with 'bbc.co.uk'\n```cmd\npython3 crawl_website.py\n```\n### Run with custom options\n```cmd\npython3 crawl_website.py -l https://webscrapethissite.org -n 10\n```\n## Testing\n```cmd\npytest test/\n```\nOr for detailed view\n```cmd\npytest -v test/\n```\n# Notes\n## Team Work \u0026 Planning\nProject's [Kanban Board](https://solstice-ceres-14f.notion.site/Web-Crawler-20699892940c46fa990d76079a0dd897)\n-   Create a Kanban Board to structure project management, process tasks into bitesize tickets that are actionable.\n-   In order to work with others in a team, I would of had a meeting to discuss the required tasks in order to complete the project.\n-   Assigned tickets to each person that can be worked on simultaenously without conflict and set deadlines.\n-   Create a system of accountability to review each others code via Kanban Board columns and fix any blocked tasks.\n-   Set meetings within the team that match the deadlines set at important milestones.\n-   Create branches in version control whereby we create multiple methods to implement or fix a feature.\n-   Peer review branches to understand what code goes into the main branch and into deployment.\n\n## Deployment Testing\n- Create tests for development usage to ensure correct functionality\n- Create secret tests that haven't been used in development to finally test the deployed code, ideally someone who hasn't coded the functionality within the team.\n\n## Issues\n- The web crawler is unable to handle erroneous url links that contain no body.\n- Failed HTTP GET request due to unauthorised permissions, partly due to headers.\n- Asynchronous gevent threads are causing the queue within the Crawler to be empty while spawning.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoj124%2Fweb_crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoj124%2Fweb_crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoj124%2Fweb_crawler/lists"}