{"id":15008794,"url":"https://github.com/rohan-bhautoo/python-web-scraper","last_synced_at":"2025-04-09T16:22:29.524Z","repository":{"id":136006838,"uuid":"612307362","full_name":"rohan-bhautoo/Python-Web-Scraper","owner":"rohan-bhautoo","description":"A python web scaper to extract content and data from a website.","archived":false,"fork":false,"pushed_at":"2023-12-25T13:50:12.000Z","size":48,"stargazers_count":3,"open_issues_count":4,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-23T18:38:41.402Z","etag":null,"topics":["beautifulsoup","python","python2","scraping","webscraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rohan-bhautoo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-10T16:42:42.000Z","updated_at":"2025-01-01T07:41:18.000Z","dependencies_parsed_at":null,"dependency_job_id":"61b7c2cb-6fb1-4d19-9b9d-2638ab7831ca","html_url":"https://github.com/rohan-bhautoo/Python-Web-Scraper","commit_stats":{"total_commits":27,"total_committers":2,"mean_commits":13.5,"dds":0.2962962962962963,"last_synced_commit":"345423f2157d688429b09c0c263c9b8531f21ef5"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rohan-bhautoo%2FPython-Web-Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rohan-bhautoo%2FPython-Web-Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rohan-bhautoo%2FPython-Web-Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rohan-bhautoo%2FPython-Web-Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rohan-bhautoo","download_url":"https://codeload.github.com/rohan-bhautoo/Python-Web-Scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248065576,"owners_count":21041921,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","python","python2","scraping","webscraper"],"created_at":"2024-09-24T19:20:34.701Z","updated_at":"2025-04-09T16:22:29.499Z","avatar_url":"https://github.com/rohan-bhautoo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg width=\"500px\" src=\"https://i.ibb.co/t2QNh4R/Web-Scraping.png\" alt=\"Web-Scraping\" border=\"0\"\u003e\n\u003c/p\u003e\n\u003cp\u003e\n  \u003cimg alt=\"Version\" src=\"https://img.shields.io/badge/version-2.7.8-brightgreen.svg\" /\u003e\n  \u003cimg alt=\"Python\" src=\"https://img.shields.io/badge/Python-3776AB?logo=python\u0026logoColor=white\" /\u003e\n\u003c/p\u003e\n\nPython Web Scraper is a simple web scraping tool built with Python. It allows you to scrape data from web pages, extract information from HTML elements, save data in text file, download all images, and store table data in a CSV file. The tool provides a user-friendly interface using the Tkinter library.\n\n## Prerequisites\n\n### Python 2.x\n```bash\npython --version\n```\n\n#### Library\n\n##### Requests\nRequests allows you to send HTTP/1.1 requests extremely easily.\n```bash\npip install requests\n```\n\n##### BeautifulSoup\nBeautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.\n```shell\npip install beautifulsoup4\n```\n\n## Installation\n\n### Clone the repository\n```bash\nhttps://github.com/rohan-bhautoo/Python-Web-Scraper.git\n```\n\n## Usage\nTo run the Python Web Scraper, execute the following command:\n```bash\npython main.py\n```\n\nThe application will open a GUI window where you can enter the URL of the web page you want to scrape. You can select various options such as extracting links, headings, images, paragraphs, meta data, CSS files, and scripts. You can also choose to download images and store the data in a CSV file.\n\n## Code Examples\n\n### Scrape Data from Web Page\n```python\nimport requests\nfrom bs4 import BeautifulSoup\n\n# Make request to website\nresponse = requests.get(url)\nhtml_content = response.content\n\n# Parse HTML with BeautifulSoup\nsoup = BeautifulSoup(html_content, 'html.parser')\n\n# Find elements and extract data\n# ...\n\n# Store data in text file\n# ...\n```\n\n### Download Images from URL\n```python\nimport requests\n\nurl = image.get('src')\n\n# send a GET request to the URL to download the image\nresponse = requests.get(url)\n\n# construct the file name to save the image as\nfilename = os.path.join(directory, 'image{}'.format(count))\n\n# use os.path.splitext to split the filename into base name and extension\n_, extension = os.path.splitext(url)\n\nprint(filename)\n\n# save the image to the chosen file path\nwith open(f'{filename}{extension}', 'wb') as f:\n    f.write(response.content)\n    count += 1\n```\n\n### Extract Table Data from Web Page\n```python\nfrom bs4 import BeautifulSoup\n\n# get URL from entry field\nurl = self.url_entry.get()\n\n# make request to website\nresponse = requests.get(url)\nhtml_content = response.content\n\n# parse HTML with BeautifulSoup\nsoup = BeautifulSoup(html_content, 'html.parser')\n\n# find table element\ntable = soup.find('table')\n\n# create table header\ntable_header = []\nfor th in table.find_all('th'):\n    table_header.append(th.text.strip())\n\n# create table rows\ntable_rows = []\nfor tr in table.find_all('tr'):\n    table_row = []\n    for td in tr.find_all('td'):\n        table_row.append(td.text.strip())\n    table_rows.append(table_row)\n```\n\n### Save Table Data in CSV file\n```python\nnow = datetime.utcnow()\nformat = now.strftime(\"%Y%m%d%H%M\")\nwith open(f\"csv/csv_{format}.csv\", \"w\") as f:\n    csvwriter = csv.writer(f, delimiter=\",\")\n\n    if includeHeader == 1:\n        print(\"save header:\", table_header)\n        csvwriter.writerow(table_header)\n\n    for row_id in self.treeview.get_children():\n        row = self.treeview.item(row_id)[\"values\"]\n        if row != \"\":\n            print(\"save row:\", row)\n            csvwriter.writerow(row)\n```\n\n## Limitation\n- The Python Web Scraper may not work on web pages with complex JavaScript-based content.\n- Some websites may have terms of service or robots.txt that prohibit scraping. Make sure to comply with any legal and ethical requirements.\n\n## Author\n\n👤 **Rohan Bhautoo**\n\n* Github: [@rohan-bhautoo](https://github.com/rohan-bhautoo)\n* LinkedIn: [@rohan-bhautoo](https://linkedin.com/in/rohan-bhautoo)\n\n## Show your support\n\nGive a ⭐️ if this project helped you!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frohan-bhautoo%2Fpython-web-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frohan-bhautoo%2Fpython-web-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frohan-bhautoo%2Fpython-web-scraper/lists"}