{"id":18993518,"url":"https://github.com/infinitode/pywebscrapr","last_synced_at":"2026-02-14T02:02:14.717Z","repository":{"id":220512112,"uuid":"751763250","full_name":"Infinitode/PyWebScrapr","owner":"Infinitode","description":"An open-source Python web scraping tool. Supports both image scraping and text scraping.","archived":false,"fork":false,"pushed_at":"2025-02-24T08:20:37.000Z","size":31,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-17T01:35:50.322Z","etag":null,"topics":["data","data-collection","data-science","open-source","pip","scraping","web-scraper"],"latest_commit_sha":null,"homepage":"https://infinitode.netlify.app","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Infinitode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-02-02T09:21:52.000Z","updated_at":"2025-02-24T08:16:48.000Z","dependencies_parsed_at":"2025-04-16T19:04:33.349Z","dependency_job_id":"05e8d009-ee1f-4a16-8691-5dd837323ef3","html_url":"https://github.com/Infinitode/PyWebScrapr","commit_stats":null,"previous_names":["infinitode/pywebscrapr"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Infinitode%2FPyWebScrapr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Infinitode%2FPyWebScrapr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Infinitode%2FPyWebScrapr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Infinitode%2FPyWebScrapr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Infinitode","download_url":"https://codeload.github.com/Infinitode/PyWebScrapr/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250242912,"owners_count":21398228,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-collection","data-science","open-source","pip","scraping","web-scraper"],"created_at":"2024-11-08T17:21:43.852Z","updated_at":"2026-02-14T02:02:14.711Z","avatar_url":"https://github.com/Infinitode.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PyWebScrapr\n![Python Version](https://img.shields.io/badge/python-3.13-blue.svg)\n[![Code Size](https://img.shields.io/github/languages/code-size/infinitode/pywebscrapr)](https://github.com/infinitode/pywebscrapr)\n![Downloads](https://pepy.tech/badge/pywebscrapr)\n![License Compliance](https://img.shields.io/badge/license-compliance-brightgreen.svg)\n![PyPI Version](https://img.shields.io/pypi/v/pywebscrapr)\n\nAn open-source Python library for web scraping tasks. Includes support for both text and image scraping.\n\n## Changes in 0.1.6:\n- Added progress indicators to both `scrape_images` and `scrape_text` to provide real-time feedback on scraping progress.\n- Implemented multithreading to improve performance by scraping multiple pages concurrently.\n- Added a `rate_limit` parameter to both scraping functions to control the request frequency and prevent server overload.\n- Refactored the concurrency model to ensure that child links are also scraped concurrently.\n\n## Changes in 0.1.5:\n- Added new params to both `scrape_images` and `scrape_text` to allow for following child links, and setting a maximum allowed followed child links.\n- Added a `json` export format for text scraping, with improvements to exporting.\n\n\u003e [!TIP]\n\u003e We recommend disabling `remove_duplicates` on large sites, to allow for faster text scraping (this can improve speed by 4x). It also may not work well with `follow_child_links` enabled, as it may remove similar content from scraped child links.\n\n## Changes in 0.1.4:\n- Added new parameters to `scrape_text` to allow automatic removal of duplicates or similar text, and another to specify the type of textual content to scrape (`text`, `content`, `unseen`, `links`).\n\n## Changes in 0.1.3:\n- Added support for handling of different types of images on websites. Also now checks for invalid images, with added error handling.\n\n## Changes in 0.1.2\n\nChanges in version 0.1.2:\n- `min` and `max` width and height parameters can now be specified when working with image scraping, allowing you to quickly exclude smaller resolution images, or images that are extremely large and take up too much space.\n- PyWebScrapr now uses BeautifulSoup4's `SoupStrainer`, making extracting content from webpages much faster.\n\n## Installation\n\nYou can install PyWebScrapr using pip:\n\n```bash\npip install pywebscrapr\n```\n\n## Supported Python Versions\n\nPyWebScrapr supports the following Python versions:\n\n- Python 3.6\n- Python 3.7\n- Python 3.8\n- Python 3.9\n- Python 3.10\n- Python 3.11\n- Python 3.12/Later (Preferred)\n\nPlease ensure that you have one of these Python versions installed before using PyWebScrapr. PyWebScrapr may not work as expected on lower versions of Python than the supported.\n\n## Features\n\n- **Text Scraping**: Extract textual content from specified URLs.\n- **Image Scraping**: Download images from specified URLs.\n\n\u003csub\u003e*for a full list check out the [PyWebScrapr official documentation](https://infinitode-docs.gitbook.io/documentation/package-documentation/pywebscrapr-package-documentation).\u003c/sub\u003e\n\n## Usage\n\n### Text Scraping\n\n```python\nfrom pywebscrapr import scrape_text\n\n# Specify links in a file or list\nlinks_file = 'links.txt'\nlinks_array = ['https://example.com/page1', 'https://example.com/page2']\n\n# Scrape text and save to the 'output.txt' file\nscrape_text(links_file=links_file, links_array=links_array, output_file='output.txt')\n```\n\n### Image Scraping\n\n```python\nfrom pywebscrapr import scrape_images\n\n# Specify links in a file or list\nlinks_file = 'image_links.txt'\nlinks_array = ['https://example.com/image1.jpg', 'https://example.com/image2.png']\n\n# Scrape images and save to the 'images' folder\nscrape_images(links_file=links_file, links_array=links_array, save_folder='images')\n```\n\n## Contributing\n\nContributions are welcome! If you encounter any issues, have suggestions, or want to contribute to PyWebScrapr, please open an issue or submit a pull request on [GitHub](https://github.com/infinitode/pywebscrapr).\n\n## License\n\nPyWebScrapr is released under the terms of the **MIT License (Modified)**. Please see the [LICENSE](https://github.com/infinitode/pywebscrapr/blob/main/LICENSE) file for the full text.\n\n**Modified License Clause**\n\nThe modified license clause grants users the permission to make derivative works based on the PyWebScrapr software. However, it requires any substantial changes to the software to be clearly distinguished from the original work and distributed under a different name.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finfinitode%2Fpywebscrapr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finfinitode%2Fpywebscrapr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finfinitode%2Fpywebscrapr/lists"}