{"id":21180664,"url":"https://github.com/okwilkins/web-crawler","last_synced_at":"2025-03-14T19:19:23.336Z","repository":{"id":95118985,"uuid":"125913366","full_name":"okwilkins/Web-Crawler","owner":"okwilkins","description":"This program will crawl through entire domains, exporting every link it can find into a txt file.","archived":false,"fork":false,"pushed_at":"2018-03-20T17:51:36.000Z","size":238,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-21T12:10:32.161Z","etag":null,"topics":["crawler","crawling","files","html","htmlparser","python","python3","reader","scraper","threading","threads","web","writer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/okwilkins.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-19T20:01:36.000Z","updated_at":"2024-05-18T05:48:13.000Z","dependencies_parsed_at":"2023-04-01T06:48:27.139Z","dependency_job_id":null,"html_url":"https://github.com/okwilkins/Web-Crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/okwilkins%2FWeb-Crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/okwilkins%2FWeb-Crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/okwilkins%2FWeb-Crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/okwilkins%2FWeb-Crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/okwilkins","download_url":"https://codeload.github.com/okwilkins/Web-Crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243632407,"owners_count":20322382,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawling","files","html","htmlparser","python","python3","reader","scraper","threading","threads","web","writer"],"created_at":"2024-11-20T17:45:50.010Z","updated_at":"2025-03-14T19:19:23.308Z","avatar_url":"https://github.com/okwilkins.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Python Web Crawler\n## Created by Oliver Wilkins\n### 19/03/2018\n\nThis program will crawl through entire domains, exporting every link it can find into a txt file.\n\n## Installating/Running the Project\n\nYou will not need to download any libraries, plug-in and play by: \n* Downloading or cloning the repository\n* Running the main.py file\n* Links which the program saves are found in the *queued.txt* and *crawled.txt* files in the [projects folder](https://github.com/HomelessSandwich/Web-Crawler/tree/master/projects) - the folder has example projects with *queued.txt* and *crawled.txt* \n\n## Important\n\n* This program works by reading a webpage and extracting the links to the *queued.txt* file, when gotten round to the program will read further links from the *queued.txt* file and will then dump the then completed (crawled) webpage to the *crawled.txt* file\n* You can try to trawl through massive domains, with many links - this will take a *VERY* long time however\n* Also note that you may need to change the NUMBER_OF_THREADS variable in the [main.py](https://github.com/HomelessSandwich/Web-Crawler/blob/master/main.py) (line 12) file - this will depend on your operating system\n```python\nNUMBER_OF_THREADS = 8\n```\n\n## Updates for the Future\n* Add a tree view for all the links found\n* Reduce the number of decoding errors\n* Fix some URLs completely shutting down threads and ultimately the whole program. This issue is described in detail [here](https://github.com/HomelessSandwich/Web-Crawler/issues/1)\n* Create a nicer output to the console + a GUI\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fokwilkins%2Fweb-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fokwilkins%2Fweb-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fokwilkins%2Fweb-crawler/lists"}