{"id":15442887,"url":"https://github.com/benderscript/top_100","last_synced_at":"2025-06-30T12:34:50.811Z","repository":{"id":197688318,"uuid":"699102614","full_name":"BenderScript/top_100","owner":"BenderScript","description":"Simple web scraper that fetches data from a Wikipedia's Top 100 Sites","archived":false,"fork":false,"pushed_at":"2023-12-19T02:18:25.000Z","size":10,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-10-18T13:16:42.004Z","etag":null,"topics":["automation","filtering","firewall","python3","scraper","top","top100","visited","websites","wikipedia"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BenderScript.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-01T23:13:11.000Z","updated_at":"2023-10-02T01:34:29.000Z","dependencies_parsed_at":"2023-10-02T01:49:57.296Z","dependency_job_id":"92decd01-93fc-44c2-bcf1-dfbd40b10538","html_url":"https://github.com/BenderScript/top_100","commit_stats":{"total_commits":5,"total_committers":2,"mean_commits":2.5,"dds":"0.19999999999999996","last_synced_commit":"472684ebf6d7afe4f438d05a6c460ee1f9560410"},"previous_names":["repenno/top_100","benderscript/top_100"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/BenderScript/top_100","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BenderScript%2Ftop_100","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BenderScript%2Ftop_100/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BenderScript%2Ftop_100/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BenderScript%2Ftop_100/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BenderScript","download_url":"https://codeload.github.com/BenderScript/top_100/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BenderScript%2Ftop_100/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262774124,"owners_count":23362249,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","filtering","firewall","python3","scraper","top","top100","visited","websites","wikipedia"],"created_at":"2024-10-01T19:31:13.080Z","updated_at":"2025-06-30T12:34:50.788Z","avatar_url":"https://github.com/BenderScript.png","language":"Python","readme":"# Top 100 Visited Sites Web Scraper\n\n## Description\nThis Python script is a simple web scraper that fetches data from a Wikipedia's Top 100 Sites (https://en.wikipedia.org/wiki/List_of_most-visited_websites), specifically an HTML table, and converts it into a list of dictionaries. It also accesses each domain in the list and prints its response code.\n\nI created this script to automate testing of Firewall, Content Filtering and NAT related services. It can be used to test if a domain is accessible or not. \nIt can also be used to test if a domain is being redirected to another domain.\n\nUsing this script helps me debug whether a rule (code) is working as expected or not. For example, if I have a rule that blocks access to a domain, I can use this script to test if the domain is indeed blocked or not. At the same time, I can get useful Syslogs for each run. \n\n## Dependencies\n- Python 3.6 or higher\n- `requests` library\n- `BeautifulSoup` from `bs4` library\n- `urlparse` from `urllib.parse` library\n\n## Functions\n- `is_valid_url(url: str) -\u003e bool`: Checks if the URL is valid.\n- `html_table_to_list(url: str, num_columns: int) -\u003e list`: Converts an HTML table into a list of dictionaries.\n- `access_domains(top_100: list) -\u003e None`: Accesses each domain in a list and prints its response code.\n\n## Usage\n1. Install the required dependencies.\n2. Run the script with Python 3.6 or higher.\n3. The script will fetch data from the specified URL, convert the HTML table into a list of dictionaries, and print the response code for each domain.\n\nPlease note that the URL of the webpage containing the table and the number of columns in the table are parameters for the `html_table_to_list` function. The default number of columns is set to 5.\n\nThe `access_domains` function takes in a list of dictionaries representing a table of domains and prints their response codes.\n\n## Example\n```python\n# Use the function\ntop_100_reference = \"https://en.wikipedia.org/wiki/List_of_most-visited_websites\"  # Replace with your URL\ntop_100_list = html_table_to_list(top_100_reference)\naccess_domains(top_100_list)\n```\nIn this example, the script fetches data from a Wikipedia page that lists the most visited websites, converts the HTML table into a list of dictionaries, and prints the response code for each domain.\n\n## Disclaimer\nPlease use this script responsibly and ensure that you are allowed to scrape the websites you choose to scrape. Some websites may prohibit scraping in their terms of service. Always respect others' intellectual property rights.","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenderscript%2Ftop_100","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbenderscript%2Ftop_100","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenderscript%2Ftop_100/lists"}