{"id":21880456,"url":"https://github.com/definetlynotai/web_scraper","last_synced_at":"2025-10-08T19:31:23.934Z","repository":{"id":247601894,"uuid":"826320463","full_name":"DefinetlyNotAI/Web_Scraper","owner":"DefinetlyNotAI","description":"Super basic web scraper cli","archived":false,"fork":false,"pushed_at":"2024-08-26T10:22:00.000Z","size":22,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-28T09:13:43.356Z","etag":null,"topics":["html-download","python","scraper","side-project","simple","web","web-download","web-scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DefinetlyNotAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-09T13:31:42.000Z","updated_at":"2024-09-30T21:49:55.000Z","dependencies_parsed_at":"2024-07-09T17:23:13.298Z","dependency_job_id":"0fcf05ad-44ed-425f-a677-274b3689688d","html_url":"https://github.com/DefinetlyNotAI/Web_Scraper","commit_stats":null,"previous_names":["definetlynotai/web_scraper"],"tags_count":1,"template":false,"template_full_name":"DefinetlyNotAI/Repo_Template","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DefinetlyNotAI%2FWeb_Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DefinetlyNotAI%2FWeb_Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DefinetlyNotAI%2FWeb_Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DefinetlyNotAI%2FWeb_Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DefinetlyNotAI","download_url":"https://codeload.github.com/DefinetlyNotAI/Web_Scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235749079,"owners_count":19039343,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["html-download","python","scraper","side-project","simple","web","web-download","web-scraper"],"created_at":"2024-11-28T09:13:49.925Z","updated_at":"2025-10-08T19:31:18.617Z","avatar_url":"https://github.com/DefinetlyNotAI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web_Scraper 📎\n\nWelcome to Web_Scraper 🌐,\na cutting-edge tool\ndesigned to scrape webpages in a very neat fashion.\nCrafted with python,\n\nThis comprehensive guide is here to equip you with everything you need\nto use Web_Scraper effectively.\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca href=\"https://github.com/DefinetlyNotAI/Web_Scraper/issues\"\u003e\u003cimg src=\"https://img.shields.io/github/issues/DefinetlyNotAI/Web_Scraper\" alt=\"GitHub Issues\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/DefinetlyNotAI/Web_Scraper/tags\"\u003e\u003cimg src=\"https://img.shields.io/github/v/tag/DefinetlyNotAI/Web_Scraper\" alt=\"GitHub Tag\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/DefinetlyNotAI/Web_Scraper/graphs/commit-activity\"\u003e\u003cimg src=\"https://img.shields.io/github/commit-activity/t/DefinetlyNotAI/Web_Scraper\" alt=\"GitHub Commit Activity\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/DefinetlyNotAI/Web_Scraper/languages\"\u003e\u003cimg src=\"https://img.shields.io/github/languages/count/DefinetlyNotAI/Web_Scraper\" alt=\"GitHub Language Count\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/DefinetlyNotAI/Web_Scraper/actions\"\u003e\u003cimg src=\"https://img.shields.io/github/check-runs/DefinetlyNotAI/Web_Scraper/main\" alt=\"GitHub Branch Check Runs\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/DefinetlyNotAI/Web_Scraper\"\u003e\u003cimg src=\"https://img.shields.io/github/repo-size/DefinetlyNotAI/Web_Scraper\" alt=\"GitHub Repo Size\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n## Table of Contents\n\n- [Installation](#-installation-and-setup-)\n- [Usage](#basic-usage)\n- [Functions Overview](#functions-overview)\n- [Dependencies](#dependencies)\n- [Contributing](#contributing)\n- [License](#license)\n\n## 🛠️ Installation and Setup 🛠️\n\n### Prerequisites\n\nEnsure your system meets these requirements:\n\n- Has Python 3.8 or higher.\n- Downloaded all required dependencies.\n\n### Step-by-Step Installation\n\n1. **Clone the Repository**: Use Git to clone Web_Scraper to your local machine. Open Command Prompt as an administrator and run:\n\n   ```powershell\n   git clone https://github.com/DefinetlyNotAI/Web_Scraper.git\n   ```\n\n2. **Navigate to the Project Directory**: Change your current directory to the cloned CHANGE_ME folder:\n\n   ```powershell\n   cd Web_Scraper\n   ```\n   \n3. **Install Dependencies**: Run `pip install -r requirements.txt`\n\n4. **Run the Web Scraper**: Run `./scrape` more info below.\n\n\n### Basic Usage\n\n\nThe utility is executed from the command line. Here's a basic example of how to use it:\n\n```bash\npython scrape.py --URL \"https://example.com\" --name \"ExampleSite\" --zip --full -y\n```\n\nYou may use `secrets.scrape.py` for beta testing functionality.\n\n### Options\n\n- `--url`: Required. The URL of the website you wish to scrape.\n- `--name`: Optional. A custom name for the scraped website. If not provided, the domain name will be used.\n- `--zip`: Optional. If set, the utility will compress the downloaded files into a ZIP archive.\n- `--full`: Optional. If set, the utility will download the full HTML content along with associated resources. Otherwise, it downloads only the basic HTML content.\n- `-y`: Optional. Automatically proceeds with the download without asking for confirmation.\n\n## Functions Overview\n\n### `download_basic_html(url)`\n\nDownloads the basic HTML content from a given URL and saves it to a file.\n\n### `download_with_resources(url)`\n\nDownloads the HTML content and associated resources from a given URL, saves them to a file, and returns the filename.\n\n### `download_images(base_url, url)`\n\nDownloads images from a given URL after processing them to get the absolute image URLs.\n\n### `zip_files(zip_filename, files, delete_after=False)`\n\nZips the files given in the 'files' list into a zip file named 'zip_filename'. Optionally deletes the files after zipping.\n\n### `parse()`\n\nMain function that serves as the entry point for the web scraping application. Parses command-line arguments to scrape a given URL, download content based on the arguments provided, and optionally zip the downloaded files.\n\n## Dependencies\n\n- `argparse`: For parsing command-line options and arguments.\n- `os`, `shutil`: For file and directory operations.\n- `requests`: For making HTTP requests.\n- `BeautifulSoup`: For parsing HTML content.\n- `zipfile`: For creating ZIP archives.\n- `tqdm`: For displaying progress bars during downloads.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a pull request or open an issue on GitHub.\n\n- [Source Code](https://github.com/DefinetlyNotAI/Web_Scraper)\n\nRead the [CONTRIBUTING](CONTRIBUTING.md) file for more information.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefinetlynotai%2Fweb_scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdefinetlynotai%2Fweb_scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefinetlynotai%2Fweb_scraper/lists"}