{"id":16466872,"url":"https://github.com/1970mr/link-crawler","last_synced_at":"2026-02-06T18:12:29.001Z","repository":{"id":182118805,"uuid":"667982298","full_name":"1970Mr/link-crawler","owner":"1970Mr","description":"Web Link Crawler: A Python script to crawl websites and collect links based on a regex pattern. Efficient and customizable.","archived":false,"fork":false,"pushed_at":"2024-05-30T14:44:05.000Z","size":33,"stargazers_count":2,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-18T07:41:14.862Z","etag":null,"topics":["clawler","crawler","crawler-python","link-crawler","link-crawler-python","link-scraper","link-scraper-python","links","python","scraper","scraper-python","website-crawler","website-scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/1970Mr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-18T18:37:31.000Z","updated_at":"2024-10-29T20:30:51.000Z","dependencies_parsed_at":null,"dependency_job_id":"4cf8c735-a5fc-4429-82d1-e9782470991e","html_url":"https://github.com/1970Mr/link-crawler","commit_stats":null,"previous_names":["github-1970/link-crawler","1970mr/link-crawler"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/1970Mr/link-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1970Mr%2Flink-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1970Mr%2Flink-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1970Mr%2Flink-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1970Mr%2Flink-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/1970Mr","download_url":"https://codeload.github.com/1970Mr/link-crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1970Mr%2Flink-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29171307,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T16:33:35.550Z","status":"ssl_error","status_checked_at":"2026-02-06T16:33:30.716Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clawler","crawler","crawler-python","link-crawler","link-crawler-python","link-scraper","link-scraper-python","links","python","scraper","scraper-python","website-crawler","website-scraper"],"created_at":"2024-10-11T11:45:04.679Z","updated_at":"2026-02-06T18:12:28.985Z","avatar_url":"https://github.com/1970Mr.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Link Crawler\n\nThis script allows you to crawl a website and collect links from its webpages based on a specified regex pattern. It can be useful for extracting links from websites for various purposes such as data scraping or analysis.\n\n## Prerequisites\n\nBefore running the script, make sure you have the following installed:\n\n- Python 3.x\n- `argparse` library\n- `requests` library\n- `re` module\n- `os` module\n- `sys` module\n- `base64` module\n- `urllib.parse` module\n- `bs4` (BeautifulSoup) library\n- `shutil` module\n\nYou can install the required dependencies using `pip`:\n\n```shell\npip install argparse requests bs4\n```\n\n## Usage\n\nTo use the script, follow these steps:\n\n1. Clone or download the script file to your local machine.\n2. Open a terminal or command prompt.\n3. Navigate to the directory where the script is located.\n4. Run the following command:\n\n   ```shell\n   python link_crawler.py -u \u003curl\u003e -p \u003cpattern\u003e [-d] [-c]\n   ```\n\n   Replace `\u003curl\u003e` with the URL of the website you want to crawl, and `\u003cpattern\u003e` with the regex pattern to match the links.\n\n   Optional flags:\n   - `-d` or `--domain`: Include the website domain for internal links. By default, it deletes the domain name from internal links and then searches for the pattern.\n   - `-c` or `--clear-directory`: Clear the directory if it already exists for this command. By default, if the command is entered with a duplicate pattern and domain, the search is not performed.\n\n5. The script will start crawling the website, collecting links from its webpages, and display the results.\n\n   - If links matching the regex pattern are found, the script will save them to a `links.txt` file in the corresponding directory.\n   - If no links are found, the script will display a message accordingly.\n\nNote: The script crawls webpages within the specified website by following links found in HTML tags such as `\u003ca\u003e`, `\u003clink\u003e`, `\u003cscript\u003e`, `\u003cbase\u003e`, `\u003cform\u003e`, and more (in all tags that contain links). It searches for `href`, `src`, and `data-src` attributes in these tags to extract the links.\n\nNote: this script finds any link anywhere on the webpage, even outside of the attributes of the tags.\n\n## Examples\n\nHere are a few examples of how you can use the script:\n\n- Crawl a website and collect all links from its webpages:\n\n  ```shell\n  python link_crawler.py -u https://example.com -p \".*\"\n  ```\n\n  This will crawl the `example.com` website, collect all links from its webpages, and save them to `links.txt` in the `data/\u003chost\u003e/\u003cpattern\u003e/` directory.\n\n- Crawl a website and collect only specific links matching a pattern:\n\n  ```shell\n  python link_crawler.py -u https://example.com -p \"https://example.com/downloads/.*\"\n  ```\n\n  This will crawl the `example.com` website and collect only the links that match the pattern `https://example.com/downloads/`.\n\n- Crawl a website and putting domains in internal links:\n\n  ```shell\n  python link_crawler.py -u https://example.com -p \".*\" -d\n  ```\n\n  This will crawl the `example.com` website, collect all links from its webpage, putting domains in internal links, and save them to `links.txt`.\n\n- Clear the directory and crawl the website to collect fresh links:\n\n  ```shell\n  python link_crawler.py -u https://example.com -p \".*\" -c\n  ```\n  \n  This will clear the existing directory (if any) for the specified command and crawl the `example.com` website to collect fresh links.\n\n## License\n\nThis script is licensed under the [MIT License](LICENSE). Feel free to modify and use it according to your needs.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1970mr%2Flink-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F1970mr%2Flink-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1970mr%2Flink-crawler/lists"}