{"id":25132005,"url":"https://github.com/taleblou/brokenlinkchecker_python","last_synced_at":"2025-04-03T00:11:50.733Z","repository":{"id":272566081,"uuid":"917033206","full_name":"taleblou/BrokenLinkChecker_Python","owner":"taleblou","description":"This Python web crawler traverses a website, verifies resource links (CSS, JS, images, videos, iframes), and identifies broken links with HTTP errors (400-599)","archived":false,"fork":false,"pushed_at":"2025-01-15T09:04:37.000Z","size":13,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-08T14:16:12.592Z","etag":null,"topics":["crawler","http","links","python","resources","website"],"latest_commit_sha":null,"homepage":"https://taleblou.ir/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/taleblou.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-15T08:20:24.000Z","updated_at":"2025-01-15T09:04:39.000Z","dependencies_parsed_at":"2025-01-15T10:24:59.249Z","dependency_job_id":"922902c9-e05f-4065-9f33-58dd2e9ecf50","html_url":"https://github.com/taleblou/BrokenLinkChecker_Python","commit_stats":null,"previous_names":["taleblou/brokenlinkchecker_python"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taleblou%2FBrokenLinkChecker_Python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taleblou%2FBrokenLinkChecker_Python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taleblou%2FBrokenLinkChecker_Python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taleblou%2FBrokenLinkChecker_Python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/taleblou","download_url":"https://codeload.github.com/taleblou/BrokenLinkChecker_Python/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246911470,"owners_count":20853657,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","http","links","python","resources","website"],"created_at":"2025-02-08T14:16:15.417Z","updated_at":"2025-04-03T00:11:50.714Z","avatar_url":"https://github.com/taleblou.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **Web Crawler Script**\n\n## **Overview**\n\nThis script is a Python-based web crawler designed to traverse a website and identify broken resource links such as CSS, JavaScript, images, videos, and iframes. It also checks if the pages belong to the same domain as the starting URL and saves a detailed report of error pages in a CSV file.\n\n---\n\n## **Features**\n\n* **Domain Filtering:** Ensures the crawler stays within the target domain.  \n* **Resource Checking:** Verifies the availability of various resource links (e.g., CSS, JS, images, videos, iframes).  \n* **Error Reporting:** Logs details of broken resources (HTTP status codes 400–599).  \n* **Concurrent Crawling:** Uses a queue to manage page visits.  \n* **Progress Tracking:** Displays a progress bar using `tqdm`.  \n* **CSV Export:** Saves error details in a CSV file for easy review.\n\n---\n\n## **Requirements**\n\nTo run this script, you need the following Python libraries installed:\n\n* `requests`: For making HTTP requests.  \n* `BeautifulSoup` (from `bs4`): For parsing HTML content.  \n* `tldextract`: For extracting domain and suffix information.  \n* `tqdm`: For displaying a progress bar.  \n* `pandas`: For saving error reports to a CSV file.\n\nYou can install the required packages with:\n\nbash\n\nCopy code\n\n`pip install requests beautifulsoup4 tldextract tqdm pandas`\n\n---\n\n## **How to Use**\n\n**Set the Starting URL:** Replace `https://www.example.com` with the URL of the website you want to crawl:  \npython  \nCopy code  \n`start_url = 'https://www.example.com'`\n\n\n\n**Configure Maximum Pages:** Update the `max_pages` variable to limit the number of pages to crawl (default is 10,000):  \npython  \nCopy code  \n`max_pages = 10000`\n\n\n\n**Run the Script:** Execute the script in your Python environment:  \nbash  \nCopy code  \n`python main.py`\n\n \n**View Results:**  \n   * If broken resource links are found, they will be saved to a file named `error_details.csv` in the script's directory.  \n   * If no errors are detected, a message will indicate no error pages were saved.\n\n---\n\n## **Output**\n\nThe output CSV file (`error_details.csv`) contains the following columns:\n\n* **Page\\_URL:** The page where the broken resource was found.  \n* **Resource\\_URL:** The URL of the broken resource.  \n* **Error\\_Code:** The HTTP status code indicating the error.\n\n---\n\n## **Notes**\n\n* **Politeness:** Consider adding a delay (`time.sleep(1)`) between requests to avoid overloading the target server.  \n* **Error Handling:** The script handles HTTP errors gracefully but logs other exceptions to the console.  \n* **Scalability:** This script is single-threaded and may need optimization for crawling large websites.\n\n---\n\n## **Example Output**\n\nSample `error_details.csv`:\n\n| Page\\_URL | Resource\\_URL | Error\\_Code |\n| ----- | ----- | ----- |\n| [https://example.com](https://example.com) | https://example.com/style.css | 404 |\n| [https://example.com](https://example.com) | https://example.com/script.js | 403 |\n\n---\n\n## **License**\n\nThis script is open-source and available for personal and educational use. Feel free to modify it to suit your needs.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaleblou%2Fbrokenlinkchecker_python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftaleblou%2Fbrokenlinkchecker_python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaleblou%2Fbrokenlinkchecker_python/lists"}