{"id":16445970,"url":"https://github.com/scottstraughan/simple-python-url-crawler","last_synced_at":"2025-07-29T04:08:45.196Z","repository":{"id":40298539,"uuid":"244671909","full_name":"scottstraughan/simple-python-url-crawler","owner":"scottstraughan","description":"Super simple Python3 website URL scraper/crawler. Multi-threaded. ","archived":false,"fork":false,"pushed_at":"2022-05-16T13:46:30.000Z","size":22,"stargazers_count":3,"open_issues_count":0,"forks_count":3,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-04-27T02:34:01.587Z","etag":null,"topics":["crawler","googlebot","lightweight","link-collection","multi-threaded","python","python3","scraper","simple"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scottstraughan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-03T15:27:19.000Z","updated_at":"2025-01-24T11:18:35.000Z","dependencies_parsed_at":"2022-07-29T00:19:09.566Z","dependency_job_id":null,"html_url":"https://github.com/scottstraughan/simple-python-url-crawler","commit_stats":null,"previous_names":["e73b025/simple-python-url-crawler","scottstraughan/simple-python-url-crawler"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/scottstraughan/simple-python-url-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scottstraughan%2Fsimple-python-url-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scottstraughan%2Fsimple-python-url-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scottstraughan%2Fsimple-python-url-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scottstraughan%2Fsimple-python-url-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scottstraughan","download_url":"https://codeload.github.com/scottstraughan/simple-python-url-crawler/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scottstraughan%2Fsimple-python-url-crawler/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267626972,"owners_count":24117709,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","googlebot","lightweight","link-collection","multi-threaded","python","python3","scraper","simple"],"created_at":"2024-10-11T09:46:00.042Z","updated_at":"2025-07-29T04:08:45.176Z","avatar_url":"https://github.com/scottstraughan.png","language":"Python","readme":"## Description\n\nA super simple multi-threaded website URL crawler. Returns a Python array of all found URLs. It can be configured to\nreturn either internal urls, external urls or both.\n\n## Dependencies\n\n- pip install requests\n- pip install beautifulsoup4\n\n## Features\n\n- Super simple; two lines of code to get a list of URLs on a website.\n- Multi-threaded.\n- Enable or disable logging.\n- Can return internal, external or both URLs.\n- Can provide optional callback method for LIVE URL finds.\n- Not much else.\n\n## Usage\n\nThe following code sample will scan site \"strongscot.com\", using 5 threads and hiding all logging information.\n\n### Find Internal and External URLs\n\n```python\ncrawler = SiteUrlCrawler(\"https://strongscot.com\", 5, False)\n\n# Print the found URLs\nfor url in crawler.crawl(SiteUrlCrawler.Mode.ALL):\n    print(\"Found: \" + url)\n```\n\nWill output something similar to this:\n\n```\nFound: https://strongscot.com/\nFound: https://strongscot.com/projects/\nFound: https://strongscot.com/cv/\nFound: https://strongscot.com/contact/\nFound: https://github.com/strongscot\nFound: https://strongscot.com/blog/20/03/03/simple-site-crawler.html\nFound: https://strongscot.com/blog/20/02/19/birthday.html\nFound: https://strongscot.com/blog/19/12/09/new-site.html\nFound: https://strongscot.com/blog/19/09/09/body-goals.html\nFound: https://strongscot.com/blog/19/09/09/cool-dropdown-ui.html\nFound: https://strongscot.com/blog/19/09/09/flying-in-a-flight-machine.html\nFound: https://github.com/strongscot/simple-python-url-crawler\n```\n\n### Find Only Internal URLs\n\n```python\ncrawler = SiteUrlCrawler(\"https://strongscot.com\")\n\n# Print the found URLs\nfor url in crawler.crawl(SiteUrlCrawler.Mode.INTERNAL):\n    print(\"Found: \" + url)\n```\n\n### Find Only External URLs\n\n```python\ncrawler = SiteUrlCrawler(\"https://strongscot.com\")\n\n# Print the found URLs\nfor url in crawler.crawl(SiteUrlCrawler.Mode.EXTERNAL):\n    print(\"Found: \" + url)\n```\n\nWill output:\n\n```\nFound: https://github.com/strongscot\nFound: https://twitter.com/thestrongscot\n```\n\n## Using Callback (getting live URL finds as they happen)\n\nIf you wish to get each URL as it is found rather than at the end in an array, you can pass an optional argument to the\n``crawl()`` method that will do exactly that. For example:\n\n```python\ncrawler = SiteUrlCrawler(\"https://strongscot.com\")\n\ndef callback(url):\n    print(\"Found: \" + url)\n\n# Get ALL urls and print them\ncrawler.crawl(SiteUrlCrawler.Mode.ALL, callback)\n```\n\n## Bad-Tip\n\nWant to make it a small Google Bot? Comment-out lines ``134`` - ``136`` in file ``SiteUrlCrawler.py`` and it will trawl even external links.\n\n## Author\n\n@strongscot\n\n## License\n\nMIT\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscottstraughan%2Fsimple-python-url-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscottstraughan%2Fsimple-python-url-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscottstraughan%2Fsimple-python-url-crawler/lists"}