{"id":27775823,"url":"https://github.com/aboodseada1/ultimate-scraper","last_synced_at":"2026-04-26T22:31:28.098Z","repository":{"id":289835069,"uuid":"972554607","full_name":"Aboodseada1/Ultimate-Scraper","owner":"Aboodseada1","description":"A standalone Python script designed to scrape web pages using a multi-layered approach with fallbacks. It attempts faster methods first and progressively uses browser automation if needed, increasing the likelihood of successfully retrieving content from various websites, including those with JavaScript rendering or basic anti-bot measures.","archived":false,"fork":false,"pushed_at":"2025-04-25T09:13:57.000Z","size":25,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-30T04:57:18.142Z","etag":null,"topics":["automation","bots","python","scraping","selenium","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Aboodseada1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-25T09:12:24.000Z","updated_at":"2025-04-25T09:14:11.000Z","dependencies_parsed_at":"2025-04-25T10:28:53.059Z","dependency_job_id":"77ab873a-37fc-4c7c-a2c0-f3b0e61e0154","html_url":"https://github.com/Aboodseada1/Ultimate-Scraper","commit_stats":null,"previous_names":["aboodseada1/ultimate-scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aboodseada1%2FUltimate-Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aboodseada1%2FUltimate-Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aboodseada1%2FUltimate-Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aboodseada1%2FUltimate-Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Aboodseada1","download_url":"https://codeload.github.com/Aboodseada1/Ultimate-Scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251644826,"owners_count":21620630,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","bots","python","scraping","selenium","webscraping"],"created_at":"2025-04-30T04:57:16.506Z","updated_at":"2026-04-26T22:31:28.067Z","avatar_url":"https://github.com/Aboodseada1.png","language":"Python","funding_links":["http://paypal.me/aboodseada1999"],"categories":[],"sub_categories":[],"readme":"# Ultimate Web Scraper\r\n\r\nA standalone Python script designed to scrape web pages using a multi-layered approach with fallbacks. It attempts faster methods first and progressively uses browser automation if needed, increasing the likelihood of successfully retrieving content from various websites, including those with JavaScript rendering or basic anti-bot measures.\r\n\r\n## Features\r\n\r\n* **Multiple Scraping Methods:** Uses `httpx`, `cloudscraper`, `playwright` (Chromium), `selenium` (Firefox), and `selenium` (Chrome) in a specific fallback order.\r\n* **Resilient Scraping:** Automatically tries the next method if a previous one fails.\r\n* **JavaScript Rendering:** Handles dynamic content loaded by JavaScript via browser automation (Playwright/Selenium).\r\n* **Anti-Bot Evasion:** Incorporates `cloudscraper` and options in browser automation to bypass some basic protections.\r\n* **Optional Browser Profiles:** Supports using pre-configured profiles for Playwright, Firefox, and Chrome (useful for logged-in sessions or specific configurations). *See Profile Setup section.*\r\n* **Flexible Output:** Saves scraped content to console or a file.\r\n* **Output Formats:** Outputs either raw HTML or cleaned/beautified text content (suitable for LLM processing) in `txt` or `json` format.\r\n* **JSON Output Details:** Includes the scraped content, the successful method used, the original URL, and beautification status.\r\n* **Configurable Logging:** Adjustable log levels for detailed debugging.\r\n* **Standalone CLI Tool:** Designed for easy command-line execution.\r\n\r\n## Scraping Method Order\r\n\r\n1. `httpx` / `cloudscraper` (Fastest, direct HTTP requests)\r\n2. `playwright` (Headless Chromium automation)\r\n3. `selenium` with Firefox (Headless automation)\r\n4. `selenium` with Chrome (Headless automation)\r\n\r\n## Prerequisites\r\n\r\n* **Python:** Python 3.8+ recommended.\r\n* **Pip:** Python package installer.\r\n* **Core Libraries:** `requests`, `httpx`, `cloudscraper`, `beautifulsoup4`, `lxml` (Install via `requirements.txt`).\r\n* **Optional Browser Automation Libraries:** (Install based on methods you want enabled)\r\n* * **For Playwright:** `playwright` library (`pip install playwright`) AND browser binaries (`playwright install chromium`)\r\n  * **For Selenium (Firefox/Chrome):** `selenium` library (`pip install selenium`)\r\n  * **For Automatic Driver Management (Recommended for Selenium):** `webdriver-manager` (`pip install webdriver-manager`). If not installed, `geckodriver` (for Firefox) and `chromedriver` (for Chrome) must be manually installed and available in your system's PATH.\r\n* **Optional Browser Profiles:** Pre-configured browser profile folders if you intend to use the profile arguments (see Profile Setup).\r\n\r\n## Installation\r\n\r\n1. **Clone the repository or download the scripts:** Contains `ultimate_scraper.py` and optional profile creation scripts.\r\n2. ```bash\r\n   git clone https://github.com/Aboodseada1/Ultimate-Scraper\r\n   cd https://github.com/Aboodseada1/Ultimate-Scraper\r\n   ```\r\n3. Or simply download the `ultimate_scraper.py` and `requirements.txt` files.\r\n4. **(Recommended)** Create and activate a Python virtual environment:\r\n5. ```bash\r\n   python -m venv venv\r\n   source venv/bin/activate # On Windows use `venv\\Scripts\\activate`\r\n   ```\r\n6. **Install dependencies from `requirements.txt`:** This file includes core and optional libraries.\r\n7. ```bash\r\n   pip install -r requirements.txt\r\n   ```\r\n8. **(Optional but required for Playwright method)** Install Playwright browser binaries:\r\n9. ```bash\r\n   playwright install chromium # Or playwright install to install all\r\n   ```\r\n\r\n## Profile Setup (Optional)\r\n\r\nThis scraper supports using existing browser profiles, which is useful for sites requiring logins or having specific cookie/storage states. You can create these profiles manually or use the provided helper scripts (`create_chrome_profile.py`, `create_firefox_profile.py`, `create_playwright_profile.py`).\r\n\r\n* Run the desired `create_*.py` script (e.g., `python create_chrome_profile.py`).\r\n* A browser window will open.\r\n* **Log in to any websites** you need sessions for (e.g., Apollo, LinkedIn, etc.).\r\n* Browse a bit to ensure cookies/local storage are saved.\r\n* Close the browser window (or press Ctrl+C in the terminal for the Chrome/Playwright scripts).\r\n* The script will create a profile folder (e.g., `Chrome-Profile`) in the same directory.\r\n* Use the path to this folder with the corresponding command-line argument (`--cp`, `--fp`, `--pp`) when running `ultimate_scraper.py`.\r\n\r\n## Usage\r\n\r\nRun the script from your terminal:\r\n\r\n```bash\r\npython ultimate_scraper.py \u003curl\u003e [options]\r\n```\r\n\r\n**Arguments:**\r\n\r\n* `url`: (Required) The full URL to scrape (must start with `http://` or `https://`).\r\n* `-o`, `--output-file`: (Optional) Path to save the output file. If omitted, output goes to the console.\r\n* `-f`, `--output-format`: (Optional) Output format: `txt` (content only) or `json` (full result dictionary). Default: `txt`.\r\n* `-r`, `--raw`: (Optional) Output raw HTML content instead of cleaned/beautified text.\r\n* `--pp`, `--playwright-profile`: (Optional) Path to the Playwright profile directory.\r\n* `--fp`, `--firefox-profile`: (Optional) Path to the Firefox profile directory.\r\n* `--cp`, `--chrome-profile`: (Optional) Path to the Chrome profile directory.\r\n* `-l`, `--log-level`: (Optional) Set logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. Default: `INFO`.\r\n\r\n## Examples\r\n\r\n*(Replace `https://example.com` and profile paths)*\r\n\r\n```bash\r\n# Scrape and print cleaned text content to console\r\npython ultimate_scraper.py https://example.com\r\n\r\n# Scrape raw HTML and save to file\r\npython ultimate_scraper.py https://quotes.toscrape.com/js/ -r -o raw_output.html\r\n\r\n# Scrape and save full JSON result (including method used) to file\r\npython ultimate_scraper.py https://httpbin.org/get -f json -o result.json\r\n\r\n# Scrape using a specific Chrome profile and save cleaned text\r\npython ultimate_scraper.py https://github.com/login --cp ./Chrome-Profile -o github_content.txt\r\n\r\n# Scrape with debug logging\r\npython ultimate_scraper.py https://news.ycombinator.com -l DEBUG\r\n```\r\n\r\n## Output Formats\r\n\r\n**TXT Format (`-f txt`, default):** Outputs only the scraped content (either raw HTML if `-r` is used, or cleaned text otherwise). If scraping fails, it outputs an error message.\r\n\r\n**JSON Format (`-f json`):** Outputs a JSON object containing:\r\n\r\n* `url`: The original URL requested.\r\n* `content`: The scraped content (string, raw or cleaned) or `null` if failed.\r\n* `method`: The name of the successful scraping method (string, e.g., `\"httpx\"`, `\"playwright\"`) or `null`.\r\n* `beautified`: Boolean indicating if the content was cleaned (`true`) or raw (`false`).\r\n* `error`: An error message (string) if all methods failed.\r\n\r\n### Example JSON Output (Success):\r\n\r\n```json\r\n{\r\n  \"url\": \"https://example.com\",\r\n  \"content\": \"Example Domain\\n\\nThis domain is for use in illustrative examples in documents...\",\r\n  \"method\": \"httpx\",\r\n  \"beautified\": true\r\n}\r\n```\r\n\r\n### Example JSON Output (Failure):\r\n\r\n```json\r\n{\r\n  \"url\": \"https://nonexistent.example.com\",\r\n  \"content\": null,\r\n  \"method\": null,\r\n  \"beautified\": true,\r\n  \"error\": \"All scraping methods failed\"\r\n}\r\n```\r\n\r\n## Dependencies\r\n\r\nSee `requirements.txt` file. Core dependencies are `requests`, `httpx`, `cloudscraper`, `beautifulsoup4`, `lxml`. Optional dependencies for browser automation are `playwright`, `selenium`, `webdriver-manager`.\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please feel free to open an issue for bugs or suggestions, or submit a pull request on [GitHub](https://github.com/Aboodseada1?tab=repositories).\r\n\r\n## Support Me\r\n\r\nIf you find this tool useful, consider supporting its development via [PayPal](http://paypal.me/aboodseada1999). Thank you!\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License.\r\n\r\n```\r\nMIT License\r\n\r\nCopyright (c) 2025 Abood\r\n\r\nPermission is hereby granted, free of charge, to any person obtaining a copy\r\nof this software and associated documentation files (the \"Software\"), to deal\r\nin the Software without restriction, including without limitation the rights\r\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\r\ncopies of the Software, and to permit persons to whom the Software is\r\nfurnished to do so, subject to the following conditions:\r\n\r\nThe above copyright notice and this permission notice shall be included in all\r\ncopies or substantial portions of the Software.\r\n\r\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\r\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\r\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\r\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\r\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\r\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\r\nSOFTWARE.\r\n```\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faboodseada1%2Fultimate-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faboodseada1%2Fultimate-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faboodseada1%2Fultimate-scraper/lists"}