{"id":15115432,"url":"https://github.com/clemlesne/scrape-it-now","last_synced_at":"2025-05-15T11:03:43.902Z","repository":{"id":253249073,"uuid":"842905162","full_name":"clemlesne/scrape-it-now","owner":"clemlesne","description":"Web scraper made for AI and simplicity in mind. It runs as a CLI that can be parallelized and outputs high-quality markdown content.","archived":false,"fork":false,"pushed_at":"2025-02-10T10:19:45.000Z","size":42008,"stargazers_count":515,"open_issues_count":16,"forks_count":19,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-13T10:09:03.868Z","etag":null,"topics":["ai","azure","cli","markdown","scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clemlesne.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-15T11:04:34.000Z","updated_at":"2025-03-30T22:45:07.000Z","dependencies_parsed_at":"2024-08-26T18:54:35.798Z","dependency_job_id":"969f4c2d-cc69-41bd-86fa-bde974bb7901","html_url":"https://github.com/clemlesne/scrape-it-now","commit_stats":null,"previous_names":["clemlesne/scrape-it-now"],"tags_count":32,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clemlesne%2Fscrape-it-now","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clemlesne%2Fscrape-it-now/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clemlesne%2Fscrape-it-now/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clemlesne%2Fscrape-it-now/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clemlesne","download_url":"https://codeload.github.com/clemlesne/scrape-it-now/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248933340,"owners_count":21185460,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","azure","cli","markdown","scraper"],"created_at":"2024-09-26T01:43:51.418Z","updated_at":"2025-04-14T18:10:38.871Z","avatar_url":"https://github.com/clemlesne.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# 🛰️ Scrape It Now!\n\nWeb scraper made for AI and simplicity in mind. It runs as a CLI that can be parallelized and outputs high-quality markdown content.\n\n[![GitHub last release date](https://img.shields.io/github/release-date/clemlesne/scrape-it-now)](https://github.com/clemlesne/scrape-it-now/releases)\n[![GitHub project license](https://img.shields.io/github/license/clemlesne/scrape-it-now)](https://github.com/clemlesne/scrape-it-now/blob/main/LICENSE)\n[![PyPI package version](https://img.shields.io/pypi/v/scrape-it-now)](https://pypi.org/project/scrape-it-now)\n[![PyPI supported Python versions](https://img.shields.io/pypi/pyversions/scrape-it-now)](https://pypi.org/project/scrape-it-now)\n\n## Features\n\nShared:\n\n- 🏗️ Decoupled architecture with [Azure Queue Storage](https://learn.microsoft.com/en-us/azure/storage/queues) or local [sqlite](https://sqlite.org)\n- ⚙️ Idempotent operations that can be run in parallel\n- 💾 Scraped content is stored in [Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs) or local disk\n\nScraper:\n\n- 🛑 Avoid re-scraping a page if it hasn't changed\n- 🚫 Block ads to lower network costs with [The Block List Project](https://github.com/blocklistproject/Lists)\n- 🔗 Explore pages in depth by detecting links and de-duplicating them\n- ✍️ Extract markdown content from a page with [Pandoc](https://github.com/jgm/pandoc)\n- 🏷️ Extract [metadata elements](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta) from the page\n- 🖥️ Load dynamic JavaScript content with [Playwright](https://github.com/microsoft/playwright-python) and [Chromium](https://www.chromium.org/Home)\n- 🕵️‍♂️ Preserve anonymity with a random user agent, random viewport size, and no client hints headers\n- 📊 Show progress with a status command\n- 🖼️ Store images collected on the page\n- 📸 Store screenshot of the page\n- 📡 Track progress of total network usage\n\nIndexer:\n\n- 🧠 AI Search index is created automatically\n- ✂️ Chunk markdown while keeping the content coherent\n- 📈 Embed chunks with OpenAI embeddings\n- 🔍 Indexed content is semantically searchable with [Azure AI Search](https://learn.microsoft.com/en-us/azure/search)\n\n## Installation\n\n### From PyPI\n\n```bash\n# Install the package\npython3 -m pip install scrape-it-now\n# Run the CLI\nscrape-it-now --help\n```\n\nTo configure the CLI (including authentication to the backend services), use environment variables, a `.env` file or command line options.\n\n### From sources\n\nApplication must be run with Python 3.13 or later. If this version is not installed, an easy way to install it is [pyenv](https://github.com/pyenv/pyenv).\n\n```bash\n# Download the source code\ngit clone https://github.com/clemlesne/scrape-it-now.git\n# Move to the directory\ncd scrape-it-now\n# Run install scripts\nmake install dev\n# Run the CLI\nscrape-it-now --help\n```\n\n## How to use\n\n### Scrape a website\n\n#### Run a job\n\nUsage with Azure Blob Storage and Azure Queue Storage:\n\n```bash\n# Azure Storage configuration\nexport AZURE_STORAGE_ACCESS_KEY=xxx\nexport AZURE_STORAGE_ACCOUNT_NAME=xxx\n# Run the job\nscrape-it-now scrape run https://nytimes.com\n```\n\nUsage with Local Disk Blob and Local Disk Queue:\n\n```bash\n# Local disk configuration\nexport BLOB_PROVIDER=local_disk\nexport QUEUE_PROVIDER=local_disk\n# Run the job\nscrape-it-now scrape run https://nytimes.com\n```\n\nExample:\n\n```bash\n❯ scrape-it-now scrape run https://nytimes.com\n2024-11-08T13:18:49.169320Z [info     ] Start scraping job lydmtyz\n2024-11-08T13:18:49.169392Z [info     ] Installing dependencies if needed, this may take a few minutes\n2024-11-08T13:18:52.542422Z [info     ] Queued 1/1 URLs\n2024-11-08T13:18:58.509221Z [info     ] Start processing https://nytimes.com depth=1 process=scrape-lydmtyz-4 task=63dce50\n2024-11-08T13:19:04.173198Z [info     ] Loaded 154554 ads and trackers process=scrape-lydmtyz-4\n2024-11-08T13:19:16.393045Z [info     ] Queued 310/311 URLs            depth=1 process=scrape-lydmtyz-4 task=63dce50\n2024-11-08T13:19:16.393323Z [info     ] Scraped                        depth=1 process=scrape-lydmtyz-4 task=63dce50\n...\n```\n\nMost frequent options are:\n\n| `Options` | Description | `Environment variable` |\n|-|-|-|\n| `--azure-storage-access-key`\u003c/br\u003e`-asak` | Azure Storage access key | `AZURE_STORAGE_ACCESS_KEY` |\n| `--azure-storage-account-name`\u003c/br\u003e`-asan` | Azure Storage account name | `AZURE_STORAGE_ACCOUNT_NAME` |\n| `--blob-provider`\u003c/br\u003e`-bp` | Blob provider | `BLOB_PROVIDER` |\n| `--job-name`\u003c/br\u003e`-jn` | Job name | `JOB_NAME` |\n| `--max-depth`\u003c/br\u003e`-md` | Maximum depth | `MAX_DEPTH` |\n| `--queue-provider`\u003c/br\u003e`-qp` | Queue provider | `QUEUE_PROVIDER` |\n| `--save-images`\u003c/br\u003e`-si` | Save images | `SAVE_IMAGES` |\n| `--save-screenshot`\u003c/br\u003e`-ss` | Save screenshot | `SAVE_SCREENSHOT` |\n| `--whitelist`\u003c/br\u003e`-w` | Whitelist | `WHITELIST` |\n\nFor documentation on all available options, run:\n\n```bash\nscrape-it-now scrape run --help\n```\n\n#### Show job status\n\nUsage with Azure Blob Storage:\n\n```bash\n# Azure Storage configuration\nexport AZURE_STORAGE_CONNECTION_STRING=xxx\n# Show the job status\nscrape-it-now scrape status [job_name]\n```\n\nUsage with Local Disk Blob:\n\n```bash\n# Local disk configuration\nexport BLOB_PROVIDER=local_disk\n# Show the job status\nscrape-it-now scrape status [job_name]\n```\n\nExample:\n\n```bash\n❯ scrape-it-now scrape status lydmtyz\n{\"created_at\":\"2024-11-08T13:18:52.839060Z\",\"last_updated\":\"2024-11-08T13:19:16.528370Z\",\"network_used_mb\":2.6666793823242188,\"processed\":1,\"queued\":311}\n```\n\nMost frequent options are:\n\n| `Options` | Description | `Environment variable` |\n|-|-|-|\n| `--azure-storage-access-key`\u003c/br\u003e`-asak` | Azure Storage access key | `AZURE_STORAGE_ACCESS_KEY` |\n| `--azure-storage-account-name`\u003c/br\u003e`-asan` | Azure Storage account name | `AZURE_STORAGE_ACCOUNT_NAME` |\n| `--blob-provider`\u003c/br\u003e`-bp` | Blob provider | `BLOB_PROVIDER` |\n\nFor documentation on all available options, run:\n\n```bash\nscrape-it-now scrape status --help\n```\n\n### Index a scraped website\n\n#### Run a job\n\nUsage with Azure Blob Storage, Azure Queue Storage and Azure AI Search:\n\n```bash\n# Azure OpenAI configuration\nexport AZURE_OPENAI_API_KEY=xxx\nexport AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=xxx\nexport AZURE_OPENAI_EMBEDDING_DIMENSIONS=xxx\nexport AZURE_OPENAI_EMBEDDING_MODEL_NAME=xxx\nexport AZURE_OPENAI_ENDPOINT=xxx\n\n# Azure Search configuration\nexport AZURE_SEARCH_API_KEY=xxx\nexport AZURE_SEARCH_ENDPOINT=xxx\n\n# Azure Storage configuration\nexport AZURE_STORAGE_ACCESS_KEY=xxx\nexport AZURE_STORAGE_ACCOUNT_NAME=xxx\n\n# Run the job\nscrape-it-now index run [job_name]\n```\n\nUsage with Local Disk Blob, Local Disk Queue and Azure AI Search:\n\n```bash\n# Azure OpenAI configuration\nexport AZURE_OPENAI_API_KEY=xxx\nexport AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=xxx\nexport AZURE_OPENAI_EMBEDDING_DIMENSIONS=xxx\nexport AZURE_OPENAI_EMBEDDING_MODEL_NAME=xxx\nexport AZURE_OPENAI_ENDPOINT=xxx\n# Azure Search configuration\nexport AZURE_SEARCH_API_KEY=xxx\nexport AZURE_SEARCH_ENDPOINT=xxx\n# Local disk configuration\nexport BLOB_PROVIDER=local_disk\nexport QUEUE_PROVIDER=local_disk\n# Run the job\nscrape-it-now index run [job_name]\n```\n\nExample:\n\n```bash\n❯ scrape-it-now index run lydmtyz\n2024-11-08T13:20:37.129411Z [info     ] Start indexing job lydmtyz\n2024-11-08T13:20:38.945954Z [info     ] Start processing https://nytimes.com process=index-lydmtyz-4 task=63dce50\n2024-11-08T13:20:39.162692Z [info     ] Chunked into 7 parts           process=index-lydmtyz-4 task=63dce50\n2024-11-08T13:20:42.407391Z [info     ] Indexed 7 chunks               process=index-lydmtyz-4 task=63dce50\n...\n```\n\nMost frequent options are:\n\n| `Options` | Description | `Environment variable` |\n|-|-|-|\n| `--azure-openai-api-key`\u003c/br\u003e`-aoak` | Azure OpenAI API key | `AZURE_OPENAI_API_KEY` |\n| `--azure-openai-embedding-deployment-name`\u003c/br\u003e`-aoedn` | Azure OpenAI embedding deployment name | `AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME` |\n| `--azure-openai-embedding-dimensions`\u003c/br\u003e`-aoed` | Azure OpenAI embedding dimensions | `AZURE_OPENAI_EMBEDDING_DIMENSIONS` |\n| `--azure-openai-embedding-model-name`\u003c/br\u003e`-aoemn` | Azure OpenAI embedding model name | `AZURE_OPENAI_EMBEDDING_MODEL_NAME` |\n| `--azure-openai-endpoint`\u003c/br\u003e`-aoe` | Azure OpenAI endpoint | `AZURE_OPENAI_ENDPOINT` |\n| `--azure-search-api-key`\u003c/br\u003e`-asak` | Azure Search API key | `AZURE_SEARCH_API_KEY` |\n| `--azure-search-endpoint`\u003c/br\u003e`-ase` | Azure Search endpoint | `AZURE_SEARCH_ENDPOINT` |\n| `--azure-storage-access-key`\u003c/br\u003e`-asak` | Azure Storage access key | `AZURE_STORAGE_ACCESS_KEY` |\n| `--azure-storage-account-name`\u003c/br\u003e`-asan` | Azure Storage account name | `AZURE_STORAGE_ACCOUNT_NAME` |\n| `--blob-provider`\u003c/br\u003e`-bp` | Blob provider | `BLOB_PROVIDER` |\n| `--queue-provider`\u003c/br\u003e`-qp` | Queue provider | `QUEUE_PROVIDER` |\n\nFor documentation on all available options, run:\n\n```bash\nscrape-it-now index run --help\n```\n\n## Architecture\n\n### Scrape\n\n```mermaid\n---\ntitle: Scrape process with Azure Storage\n---\ngraph LR\n  cli[\"CLI\"]\n  web[\"Website\"]\n\n  subgraph \"Azure Queue Storage\"\n    to_chunk[\"To chunk\"]\n    to_scrape[\"To scrape\"]\n  end\n\n  subgraph \"Azure Blob Storage\"\n    subgraph \"Container\"\n      job[\"job\"]\n      scraped[\"scraped\"]\n      state[\"state\"]\n    end\n  end\n\n  cli -- (1) Pull message --\u003e to_scrape\n  cli -- (2) Get cache --\u003e scraped\n  cli -- (3) Browse --\u003e web\n  cli -- (4) Update cache --\u003e scraped\n  cli -- (5) Push state --\u003e state\n  cli -- (6) Add message --\u003e to_scrape\n  cli -- (7) Add message --\u003e to_chunk\n  cli -- (8) Update state --\u003e job\n```\n\n### Index\n\n```mermaid\n---\ntitle: Scrape process with Azure Storage and Azure AI Search\n---\ngraph LR\n  search[\"Azure AI Search\"]\n  cli[\"CLI\"]\n  embeddings[\"Azure OpenAI Embeddings\"]\n\n  subgraph \"Azure Queue Storage\"\n    to_chunk[\"To chunk\"]\n  end\n\n  subgraph \"Azure Blob Storage\"\n    subgraph \"Container\"\n      scraped[\"scraped\"]\n    end\n  end\n\n  cli -- (1) Pull message --\u003e to_chunk\n  cli -- (2) Get cache --\u003e scraped\n  cli -- (3) Chunk --\u003e cli\n  cli -- (4) Embed --\u003e embeddings\n  cli -- (5) Push to search --\u003e search\n```\n\n## Design\n\nBlob storage is organized in folders:\n\n```txt\n[job_name]-scraping/            # Job name (either defined by the user or generated)\n    scraped/                    # All the data from the pages\n        [page_id]/              # Assets from a page\n            screenshot.jpeg     # Screenshot (if enabled)\n            [image_id].[ext]    # Image binary (if enabled)\n            [image_id].json     # Image metadata (if enabled)\n        [page_id].json          # Data from a page\n    state/                      # Job states (cache \u0026 parallelization)\n        [page_id]               # Page state\n    job.json                    # Job state (aggregated stats)\n```\n\nPage data is considered as an API (won't break until the next major version) and is stored in JSON format:\n\n```json\n{\n  \"created_at\": \"2024-09-11T14:06:43.566187Z\",\n  \"redirect\": \"https://www.nytimes.com/interactive/2024/podcasts/serial-season-four-guantanamo.html\",\n  \"status\": 200,\n  \"url\": \"https://www.nytimes.com/interactive/2024/podcasts/serial-season-four-guantanamo.html\",\n  \"content\": \"## Listen to the trailer for Serial Season 4...\",\n  \"etag\": null,\n  \"links\": [\n    \"https://podcasts.apple.com/us/podcast/serial/id917918570\",\n    \"https://music.amazon.com/podcasts/d1022069-8863-42f3-823e-857fd8a7b616/serial?ref=dm_sh_OVBHkKYvW1poSzCOsBqHFXuLc\",\n    ...\n  ],\n  \"metas\": {\n    \"description\": \"“Serial” returns with a history of Guantánamo told by people who lived through key moments in Guantánamo’s evolution, who know things the rest of us don’t about what it’s like to be caught inside an improvised justice system.\",\n    \"articleid\": \"100000009373583\",\n    \"twitter:site\": \"@nytimes\",\n    ...\n  },\n  \"network_used_mb\": 1.041460037231445,\n  \"raw\": \"\u003chead\u003e...\u003c/head\u003e\u003cbody\u003e...\u003c/body\u003e\",\n  \"valid_until\": \"2024-09-11T14:11:37.790570Z\"\n}\n```\n\nThen, indexed data is stored in Azure AI Search:\n\n| Field | Type | Description |\n|-|-|-|\n| `chunck_number` | `Edm.Int32` | Chunk number, from `0` to *`x`* |\n| `content` | `Edm.String` | Chunck content |\n| `created_at` | `Edm.DateTimeOffset` | Source scrape date |\n| `id` | `Edm.String` | Chunck ID |\n| `title` | `Edm.String` | Source page title |\n| `url` | `Edm.String` | Source page URL |\n\n## Advanced usage\n\n### Whitelist\n\nWhitelist option allows to restrict to a domain and ignore sub paths. It is a list of regular expressions:\n\n```txt\ndomain1,regexp1,regexp2 domain2,regexp3\n```\n\nFor examples:\n\nTo whitelist `learn.microsoft.com`:\n\n```txt\nlearn\\.microsoft\\.com\n```\n\nTo whitelist `learn.microsoft.com` and `go.microsoft.com`, but ignore all sub paths except `/en-us`:\n\n```txt\nlearn\\.microsoft\\.com,^/(?!en-us).* go\\.microsoft\\.com\n```\n\n### Source environment variables\n\nTo configure easily the CLI, source environment variables from a `.env` file. For example, for the `--azure-storage-access-key` option:\n\n```bash\nAZURE_STORAGE_ACCESS_KEY=xxx\n```\n\nFor arguments that accept multiple values, use a space-separated list. For example, for the `--whitelist` option:\n\n```bash\nWHITELIST=learn\\.microsoft\\.com go\\.microsoft\\.com\n```\n\n### Application cache directory\n\nThe cache directoty depends on the operating system:\n\n- `~/.config/scrape-it-now` (Unix)\n- `~/Library/Application Support/scrape-it-now` (macOS)\n- `C:\\Users\\\u003cuser\u003e\\AppData\\Roaming\\scrape-it-now` (Windows)\n\n### Broswer binary installation\n\nBrowser binaries are automatically downloaded or updated at each run. Browser is Chromium and it is not configurable (feel free to open an issue if you need another browser), it weights around 450MB. Cache is stored in the cache directory.\n\n### How Local Disk storage works\n\nLocal Disk storage is used for both blob and queue. It is not recommended for production use, as it is not easily scalable, and not fault-tolerant. It is useful for testing and development or when you cannot use Azure services.\n\nImplementation:\n\n- Local Disk Blob uses a directory structure to store blobs. Each blob is stored in a file with the blob name as the file name. Lease is implemented with lock files. By default, files are stored in a directory relative to the command execution directory.\n- Local Disk Queue uses a SQLite database to store messages. Database is stored in the cache directory. SQL databases implement visibility timeout and deletion tokens to ensure consistency to the stateless queue services like Azure Queue Storage.\n\n### Use proxies for anonymity\n\nProxies are not implemented in the application. Network security cannot be achieved from the application level. Use a VPN (e.g. your, third-party) or a proxy service (e.g. residential procies, Tor) to ensure anonymity and configure the system firewall to limit the application network access to it.\n\n### Bundle with a container\n\nAs the application is packaged to PyPi, it can easily be bundled with a container. At every start, the application will download the dependencies (browser, etc.) and cache them. You can pre-download them by running the command `scrape-it-now scrape install`.\n\nA good technique for performance would also to parallelize the scraping and indexing jobs by running multiple containers of each. This can be achieved with [KEDA](https://keda.sh), by configuring a [queue scaler](https://keda.sh/docs/2.16/scalers/azure-storage-queue).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclemlesne%2Fscrape-it-now","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclemlesne%2Fscrape-it-now","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclemlesne%2Fscrape-it-now/lists"}