{"id":28718716,"url":"https://github.com/scrapesome/scrapesome","last_synced_at":"2025-06-15T05:03:25.234Z","repository":{"id":295286011,"uuid":"989541282","full_name":"scrapesome/scrapesome","owner":"scrapesome","description":"A Powerful Web Scraper with dynamic rendering support.","archived":false,"fork":false,"pushed_at":"2025-06-08T04:05:17.000Z","size":178,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-08T05:39:04.109Z","etag":null,"topics":["asyncio","markdown","parser","playwright","python","scraper","scraping"],"latest_commit_sha":null,"homepage":"http://scrapesome.github.io/scrapesome/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapesome.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":".github/SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-24T10:03:26.000Z","updated_at":"2025-06-08T04:05:22.000Z","dependencies_parsed_at":"2025-06-08T05:34:02.644Z","dependency_job_id":null,"html_url":"https://github.com/scrapesome/scrapesome","commit_stats":null,"previous_names":["scrapesome/scrapesome"],"tags_count":14,"template":false,"template_full_name":null,"purl":"pkg:github/scrapesome/scrapesome","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapesome%2Fscrapesome","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapesome%2Fscrapesome/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapesome%2Fscrapesome/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapesome%2Fscrapesome/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapesome","download_url":"https://codeload.github.com/scrapesome/scrapesome/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapesome%2Fscrapesome/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259924680,"owners_count":22932782,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asyncio","markdown","parser","playwright","python","scraper","scraping"],"created_at":"2025-06-15T05:03:24.620Z","updated_at":"2025-06-15T05:03:25.223Z","avatar_url":"https://github.com/scrapesome.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ScrapeSome\n\n![Scrapesome Logo](https://raw.githubusercontent.com/scrapesome/scrapesome/refs/heads/main/docs/assets/images/favicon.png)\n\n\n![PyPI](https://img.shields.io/pypi/v/scrapesome)\n![Python](https://img.shields.io/pypi/pyversions/scrapesome)\n![Downloads](https://img.shields.io/pypi/dm/scrapesome)\n![License](https://img.shields.io/github/license/scrapesome/scrapesome)\n![Issues](https://img.shields.io/github/issues/scrapesome/scrapesome)\n![Discussions](https://img.shields.io/github/discussions/scrapesome/scrapesome)\n![Contributors](https://img.shields.io/github/contributors/scrapesome/scrapesome)\n![Forks](https://img.shields.io/github/forks/scrapesome/scrapesome)\n![Stars](https://img.shields.io/github/stars/scrapesome/scrapesome)\n\n\n\n**ScrapeSome** is a lightweight, flexible web scraping library with both **synchronous** and **asynchronous** support. It includes intelligent fallbacks, JavaScript page rendering, response formatting (HTML → Text/JSON/Markdown), and retry mechanisms. Ideal for developers who need robust scraping utilities with minimal setup.\n\n---\n\n## Table of Contents\n\n- [💡 Why Use ScrapeSome?](#-why-use-scrapesome)\n- [🚀 Features](#-features)\n- [⚖ Comparison with Alternatives](#-comparison-with-alternatives)\n- [📦 Installation](#-installation)\n- [Playwright Setup](#playwright-setup)\n  - [Windows](#windows)\n  - [Linux (Ubuntu/Debian)](#linux-ubuntudebian)\n  - [macOS](#macos)\n- [⚡ Quick Start](#-quick-start)\n- [🖥️ CLI Usage](#-cli-usage)\n- [🧰 Advanced Usage](#-advanced-usage)\n- [🧪 Testing](#-testing)\n- [⚙️ Environment Configuration](#️-environment-configuration)\n- [📄 Output Formats](#-output-formats)\n- [📁 Project Structure](#-project-structure)\n- [🔒 License](#-license)\n- [🤝 Contributions](#-contributions)\n\n\n## 💡 Why Use ScrapeSome?\n\n- Handles both static and JS-heavy pages out of the box\n- Supports both sync and async scraping\n- Converts raw HTML into clean text, JSON, or Markdown\n- Works with minimal configuration (`pip install scrapesome`)\n- Handles timeouts, retries, redirects, user agents\n\n\n## 🚀 Features\n\n- 🔁 Sync + Async scraping support\n- 🔄 Automatic retries and intelligent fallbacks\n- 🧪 Playwright rendering fallback for JS-heavy pages\n- 📝 Format responses as raw HTML, plain **text**, **Markdown**, or structured **JSON**\n- ⚙️ Configurable: timeouts, redirects, user agents, and logging\n- 🧪 Test coverage with `pytest` and `pytest-asyncio`\n\n---\n\n## ⚖ Comparison with Alternatives\n\n| Feature                          | ScrapeSome ✅                         | Playwright (Python)        | Selenium + UC               | Requests-HTML              | Scrapy + Playwright         |\n|----------------------------------|--------------------------------------|-----------------------------|------------------------------|-----------------------------|------------------------------|\n| 🧠 JS Rendering Support          | ✅ Auto fallback on 403/JS content    | ✅ Always (manual control)  | ✅ Always (manual control)   | ✅ Partial (via Pyppeteer)  | ✅ Requires setup            |\n| 🔄 Automatic Fallback (403/Blank)| ✅ Yes (seamless)                     | ❌ Manual logic needed       | ❌ Manual logic needed        | ❌ No                       | ❌ Needs per-request config  |\n| 🔁 Uses Browser Engine           | ✅ Only when needed (Playwright)      | ✅ Always                   | ✅ Always                    | ✅ (Unstable, slow)         | ✅ Always (if enabled)       |\n| ✅ Sync + Async Support         | ✅ Built-in                           | ❌ Async only               | ❌ Manual (via threading)    | ❌ Sync only                | ❌ Async only (via plugin)   |\n| 📝 JSON/Markdown/HTML Output    | ✅ Built-in formats                   | ❌ Manual parsing           | ❌ Manual parsing            | ❌ Basic only               | ❌ Custom pipeline needed    |\n| ⚡ Minimal Setup                 | ✅ Near zero                          | ❌ Code + browser install   | ❌ Driver + setup            | ✅ Simple pip install       | ❌ Complex + plugin setup    |\n| 🔁 Retries, Timeouts, Agents    | ✅ Smart defaults built-in            | ❌ Manual handling          | ❌ Manual handling           | ❌ Limited                  | ⚠️ Partial via settings      |\n| 🧪 Pytest-Ready Out-of-the-box  | ✅ Fully testable                     | ⚠️ Requires mocks           | ❌ Hard to test              | ❌ Minimal                  | ⚠️ Needs testing harness     |\n| ⚙️ Config via .env / Inline     | ✅ Flexible and optional              | ❌ Code/config only         | ❌ Manual via code           | ❌ Hardcoded mostly         | ⚠️ Project settings          |\n| 📦 Install \u0026 Run in \u003c1 Min      | ✅ Yes                                | ❌ Setup required           | ❌ Driver + config needed    | ✅ Yes                      | ❌ Needs project + plugin    |\n\n\n\n\n## 📦 Installation\n\n```bash\npip install scrapesome\n```\n\n\n## Playwright Setup\n\nScrapeSome uses Playwright for JavaScript rendering fallback. To enable this, you need to install Playwright and its dependencies.\n\n### 1. Install Playwright Python package if not installed\n\n```bash\npip install playwright\n```\n\n### 2. Install Playwright browsers\n\n```bash\nplaywright install\n```\n### 3. Install system dependencies\nPlaywright requires some system libraries to run browsers, which vary by operating system.\n\nFor Windows\nPlaywright installs everything you need automatically with playwright install, so no additional setup is usually required.\n\nFor Linux (Ubuntu/Debian)\nRun the following command to install required system libraries:\n\n```bash\nplaywright install-deps\n```\nIf you don't have playwright CLI available, you can install dependencies manually:\n\n```bash\nsudo apt-get update\nsudo apt-get install -y libwoff1 libopus0 libwebp6 libharfbuzz-icu0 libwebpmux3 \\\n                        libenchant-2-2 libhyphen0 libegl1 libglx0 libgudev-1.0-0 \\\n                        libevdev2 libgles2 libx264-160\n```\nNote: Package names may vary depending on your distribution and version.\n\nFor macOS\nYou can install required libraries using Homebrew:\n\n```bash\nbrew install harfbuzz enchant\n```\n\nAfter this setup, you should be able to use ScrapeSome with full Playwright rendering support!\n\n## ⚡ Quick Start\nSynchronous Example\n\n```python\nfrom scrapesome import sync_scraper\nhtml = sync_scraper(\"https://example.com\")\nhtml\n```\n\n\nAsynchronous Example\n\n```python\nimport asyncio\nfrom scrapesome import async_scraper\nhtml = asyncio.run(async_scraper(\"https://example.com\"))\nhtml\n```\n## 🖥️ CLI Usage\n\nScrapeSome also includes a powerful CLI for quick and easy scraping from the command line.\n\n### 📦 Installation with CLI Support\n\nTo use the CLI, install with the optional `cli` extras:\n\n```bash\npip install scrapesome[cli]\n```\n\n### 🔧 Basic Usage\n\n```bash\nscrapesome scrape --url https://example.com\n```\nThis performs a synchronous scrape and outputs plain text by default.\n\n### ⚙️ Available Options\n| Option             | Description                               | Default |\n|--------------------|-------------------------------------------|---------|\n| `--async-mode`     | Use asynchronous scraping                  | False   |\n| `--force-playwright`| Force JavaScript rendering using Playwright | False   |\n| `--output-format`  | Choose `text`, `json`, `markdown`, or `html` | html    |\n\n\n### Examples\n\n#### Basic scrape\n```bash\nscrapesome scrape --url https://example.com\n```\n\n#### Force Playwright rendering\n```bash\nscrapesome scrape --url https://example.com --force-playwright\n```\n\n#### Get JSON output\n```bash\nscrapesome scrape --url https://example.com --output-format json\n```\n\n#### Async scrape with markdown output\n```bash\nscrapesome scrape --url https://example.com --async-mode --output-format markdown\n```\n\n## 📄 File Saving\n\nScrapeSome allows you to format and save your scraped content with zero hassle—both via the **CLI** and in **Python code**.\n\n---\n\n### 💻 Save Output to File\n\nUse these flags to save your output directly from the command line:\n\n- `--save-to-file` or `-s`: Enable saving to a file\n- `--file-name` or `-n`: Desired filename (extension added automatically)\n- `--output-format` or `-f`: One of `html`, `text`, `markdown`, or `json`\n\n⚠️ **Note:** When saving to a file, only one URL can be scraped at a time.\n\n#### 📦 Example:\n\n```bash\nscrapesome scrape --url \"https://example.com\" --output-format markdown  --save-to-file --file-name output\n```\n\n👉 This saves the result as `output.md`.\n\n---\n\n### Save Output in Code\n\nThe `sync_scraper` function supports saving to file using two optional flags:\n\n- `save_to_file=True`: Enables saving\n- `file_name=\"your_file_name\"`: Sets the base filename (extension inferred from format)\n\nThe output will be returned as a dictionary:\n\n```bash\n{\n    \"data\": \"\u003cformatted content\u003e\",\n    \"file\": \"your_file_name.\u003cext\u003e\"  # if saving is enabled\n}\n```\n\n#### 📌 Example:\n\n```python\nresult = sync_scraper(url=\"https://example.com\", output_format_type=\"json\", save_to_file=True, file_name=\"example_output\")\nprint(f\"Saved output to {result.get('file')}\")\n```\n\nNow you're set to save clean, readable data in your preferred format—programmatically or from the CLI.\n\n## 🧰 Advanced Usage\n\nForce Rendering (Playwright)\n\n```python\nfrom scrapesome import sync_scraper\ncontent = sync_scraper(\"https://example.com\", force_playwright=True)\ncontent\n```\n\nCustom User Agents\n\n```python\nfrom scrapesome import sync_scraper\ncontent = sync_scraper(\"https://example.com\", user_agents=[\"MyCustomAgent/1.0\"])\ncontent\n```\n\nControl Redirects\n\n```python\nfrom scrapesome import sync_scraper\ncontent = sync_scraper(\"https://example.com\", allow_redirects=False)\ncontent\n```\n\nsimilarly **async_scraper** can also be used.\n\n## 🧪 Testing\nRun tests with:\n\n```bash\npytest --cov=scrapesome tests/\n```\nTarget coverage: 75–100%\n\n## ⚙️ Environment Configuration\nScrapeSome reads from environment variables if a .env file is present.\n\nExample .env\n\n```env\nLOG_LEVEL=INFO\nOUTPUT_FORMAT=text\nFETCH_PLAYWRIGHT_TIMEOUT=10\nFETCH_PAGE_TIMEOUT=10\nUSER_AGENTS=[\"Mozilla/5.0 (Windows NT 10.0; Win64; x64).......\"]\n```\n\n| Key                      | Description                                          |\n|--------------------------|------------------------------------------------------|\n| FETCH_PLAYWRIGHT_TIMEOUT | Timeout for Playwright-rendered pages (in seconds)  |\n| FETCH_PAGE_TIMEOUT       | Timeout for standard page fetch (in seconds)        |\n| LOG_LEVEL                | Logging verbosity (DEBUG, INFO, WARNING, etc.)      |\n| OUTPUT_FORMAT            | Default output format (text, markdown, json, html)  |\n| USER_AGENTS              | Default user agents (\"Mozilla/5.0 (Windows NT 10.0; Win64; x64).......\")  |\n\n## 📄 Output Formats\n\nJSON Example\n\nGet `json` version\n\n```python\nfrom scrapesome import sync_scraper\ncontent = sync_scraper(\"https://example.com\", output_format_type=\"json\")\ncontent\n```\n\nOutput\n\n```json\n{\n  \"title\": \"Example Domain\",\n  \"description\": \"This domain is for use in illustrative examples.\",\n  \"url\": \"https://example.com\"\n}\n```\n\n## Markdown\n\nConvert HTML to Markdown with:\n\n```python\nfrom scrapesome import sync_scraper\ncontent = sync_scraper(\"https://adenuniversity.us\", output_format_type=\"markdown\")\ncontent\n```\nOutput\n\n```text\n# Online Global Masters that boost your global career\n\n**ADEN University** offers students access to professionals who operate in the world of business and administration, who share their knowledge and acumen collaboratively with their students in all **academic programs** offered at ADEN.\n\n[About Us](about-aden-university)\n\n\nWatch testimonial video \n\n\n##### Watch testimonial video\n\n×\n\n[\n\n](https://res.cloudinary.com/cruminott/video/upload/vc_auto,w_auto,q_auto,f_auto/adenu/aden-university-3.mp4)\n\n\n\n## ADEN University offers the following academic programs\n\n[![EXECUTIVE MBA. Master of Business Administration](https://adenuniversity.us/files/2016/06/ADENU_miniatura_Emba_900-1-820x400.jpg \"EXECUTIVE MBA. Master of Business Administration\")](https://adenuniversity.us/academics/executive-mba/  \"EXECUTIVE MBA. Master of Business Administration\")\n\n##### [EXECUTIVE MBA. Master of Business Administration](https://adenuniversity.us/academics/executive-mba/ \"EXECUTIVE MBA. Master of Business Administration\")\n\nThe ADEN University Executive MBA is designed to strengthen business leaders to manage...\n\n* **37** credits\n* **15** months\n* **Spanish Only**\n\n[Visit EMBA Course](https://adenuniversity.us/academics/executive-mba/ \"EXECUTIVE MBA. Master of Business Administration\")\n\n[![GLOBAL MBA. Master of Business Administration](https://adenuniversity.us/files/2016/06/ADENU_miniatura_MBAgl1_900-820x400.jpg \"GLOBAL MBA. Master of Business Administration\")](https://adenuniversity.us/academics/global-mba/  \"GLOBAL MBA. Master of Business Administration\")\n\n##### [GLOBAL MBA. Master of Business Administration](https://adenuniversity.us/academics/global-mba/ \"GLOBAL MBA. Master of Business Administration\")\n\nThe Global MBA is designed to prepare business leaders to manage companies in an...\n\n* **36** credits\n* **14** months\n* **Spanish and English**\n```\n\nsimilarly **async_scraper** can also be used.\n\n## 📁 Project Structure\n\n```text\nscrapesome/\n├── .gitignore\n├── pytest.ini\n├── mkdocs.yml\n├── .github/\n│   ├── workflows/\n│   │   └── deploy.yml\n│   ├── ISSUE_TEMPLATE/\n│   │   └── index.md\n│   ├── PULL_REQUEST_TEMPLATE.md\n│   ├── CODE_OF_CONDUCT.md\n│   └── SECURITY.md\n├── __init__.py\n├── cli.py\n├── config.py\n├── exceptions.py\n├── formatter/\n│   ├── __init__.py\n│   └── output_formatter.py\n├── logging.py\n├── scraper/\n│   ├── __init__.py\n│   ├── async_scraper.py\n│   ├── sync_scraper.py\n│   └── rendering.py\n├── utils/\n│   ├── __init__.py\n│   └── file_writer.py\n├── docs/\n│   ├── index.md\n│   ├── getting_started.md\n│   ├── usage.md\n│   ├── config.md\n│   ├── examples.md\n│   ├── cli.md\n│   ├── about.md\n│   ├── licence.md\n│   ├── file-saving.md\n│   ├── contribution.md\n│   ├── output-formats.md\n│   └── assets/\n│       └── images/\n│           └── favicon.png\n├── tests/\n│   ├── __init__.py\n│   ├── test_sync_scraper.py\n│   ├── test_async_scraper.py\n│   ├── test_config.py\n│   ├── test_logging.py\n│   ├── test_rendering.py\n│   ├── test_file_writer.py\n│   ├── test_output_formatter.py\n│   └── test_cli.py\n├── setup.py\n├── requirements.txt\n├── pyproject.toml\n├── LICENSE\n└── README.md\n```\n\n## 🔒 License\nMIT License © 2025\n\n## 🤝 Contributions\n\nContributions are welcome! Whether it's bug reports, feature suggestions, or pull requests — your help is appreciated.\n\nTo get started:\n\n```bash\ngit clone https://github.com/scrapesome/scrapesome.git\ncd scrapesome\n```\n\n## Community\n\n- [Contributing Guidelines](./docs/contribution.md)\n- [Code of Conduct](.github/CODEOFCONDUCT.md)\n- [Issue Templates](.github/issue_templates/index.md)\n- [Pull Request Templates](.github/pull_request_template.md)\n- [GitHub Discussions](https://github.com/scrapesome/scrapesome/discussions)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapesome%2Fscrapesome","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapesome%2Fscrapesome","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapesome%2Fscrapesome/lists"}