{"id":46451417,"url":"https://github.com/jonasrenault/scrapix","last_synced_at":"2026-03-06T00:31:08.449Z","repository":{"id":321231320,"uuid":"1081010286","full_name":"jonasrenault/scrapix","owner":"jonasrenault","description":"Scrapix - Smart, fast, and simple image scraper for Google Images Search","archived":false,"fork":false,"pushed_at":"2025-11-25T14:11:35.000Z","size":621,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-28T15:47:36.120Z","etag":null,"topics":["google","google-images","google-images-crawler","google-search","headless","python","recaptcha-slover","scraper","scraping","search","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jonasrenault.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-22T07:23:18.000Z","updated_at":"2025-11-25T14:10:49.000Z","dependencies_parsed_at":"2025-10-28T15:32:43.111Z","dependency_job_id":null,"html_url":"https://github.com/jonasrenault/scrapix","commit_stats":null,"previous_names":["jonasrenault/scrapix"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/jonasrenault/scrapix","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasrenault%2Fscrapix","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasrenault%2Fscrapix/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasrenault%2Fscrapix/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasrenault%2Fscrapix/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jonasrenault","download_url":"https://codeload.github.com/jonasrenault/scrapix/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasrenault%2Fscrapix/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30156285,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T22:39:40.138Z","status":"ssl_error","status_checked_at":"2026-03-05T22:39:24.771Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["google","google-images","google-images-crawler","google-search","headless","python","recaptcha-slover","scraper","scraping","search","web-scraping"],"created_at":"2026-03-06T00:31:07.815Z","updated_at":"2026-03-06T00:31:08.440Z","avatar_url":"https://github.com/jonasrenault.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🖼️ Scrapix - Smart, fast, and simple image scraper for Google Images Search\n\nScrapix is an automated image scraper designed to collect pictures from Google Images Search based on user-defined queries. It streamlines the process of fetching, filtering, and storing image results for use in datasets, research, or creative projects.\n\n## Installation\n\nScrapix requires a recent version of python: ![python_version](https://img.shields.io/badge/Python-%3E=3.12-blue).\n\nScrapix uses the [pydoll](https://github.com/autoscrape-labs/pydoll) library to automate control of a Chrome browser. **It is therefore required to have a recent version of Chrome browser installed in order to run Scrapix.**\n\n### Install from github\n\nClone the repository and install the project in your python environment, either using `pip`\n\n```bash\ngit clone https://github.com/jonasrenault/scrapix.git\ncd scrapix\npip install --editable .\n```\n\nor [uv](https://docs.astral.sh/uv/)\n\n```bash\ngit clone https://github.com/jonasrenault/scrapix.git\ncd scrapix\nuv sync\n```\n\n## Usage\n\n### Command-line\n\nWhen you install Scrapix in a virtual environment, it creates a CLI script called `scrapix`. Run\n\n```bash\nscrapix --help\n```\n\nto see the various commands available. The main command is `scrape` which will search for images matching a query on Google Search, save the image urls in a file on disk, and optionally download the images on disk too.\n\nFor example, the following command\n\n```bash\nscrapix scrape duck -l 5 --min 640 640 -k rubber -k toy\n```\n\nwill search for 5 images of `duck`, only keeping images with a minimum resolution of `640x640` pixels, and excluding images which may contain the words `rubber` or `toy` in their url or title.\n\nThe results will be saved by default in `~/.cache/scrapix/{query}` (can be changed with the `--dir` option). It will save a JSON file containing the scraped image urls and titles, and will also download the images if the `--download` flag is set (it's on by default). Here are the image urls that the above command will save in the file `~/.cache/scrapix/duck/urls.json`:\n\n```json\n[\n  {\n    \"title\": \"Mallard Duck | National Geographic Kids\",\n    \"url\": \"https://i.natgeofe.com/k/7ce14b7f-df35-4881-95ae-650bce0adf4d/mallard-male-standing_square.jpg\"\n  },\n  {\n    \"title\": \"Ten Things You Didn't Know About Ducks\",\n    \"url\": \"https://assets.farmsanctuary.org/content/uploads/2025/06/17071818/2021_04-28_FSNY_Macka_and_Milo_ducks_DSC_3924_CREDIT_Farm_Sanctuary-1600x1068.jpg\"\n  },\n  {\n    \"title\": \"Mallard Duck | National Geographic Kids\",\n    \"url\": \"https://i.natgeofe.com/k/327b01e8-be2e-4694-9ae9-ae7837bd8aea/mallard-male-swimming.jpg\"\n  },\n  {\n    \"title\": \"10 Facts About Ducks - FOUR PAWS International - Animal Welfare Organisation\",\n    \"url\": \"https://media.4-paws.org/a/f/4/7/af47ae6aa55812faa4d7fd857a6e283a8c8226bc/VIER%20PFOTEN_2019-07-18_013-2890x2000-1920x1329.jpg\"\n  },\n  {\n    \"title\": \"Duck - Wikipedia\",\n    \"url\": \"https://upload.wikimedia.org/wikipedia/commons/b/bf/Bucephala-albeola-010.jpg\"\n  }\n]\n```\n\n#### Arguments\n\n```bash\nscrapix scrape --help\n\n Usage: scrapix scrape [OPTIONS] QUERY\n\n╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n│ *    query      TEXT  Search query. [required]                                                                              │\n╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n│ --dir       -d                   DIRECTORY             Save directory.   [default: ~/.cache/scrapix]                        │\n│ --limit     -l                   INTEGER               Max number of images to download. [default: 10]                      │\n│ --skip      -s                   INTEGER               Number of results to skip. [default: 0]                              │\n│ --keywords  -k                   TEXT                  Keywords to exclude.                                                 │\n│ --min                            \u003cINTEGER INTEGER\u003e...  Minimum resolution of images.                                        │\n│ --max                            \u003cINTEGER INTEGER\u003e...  Maximum resolution of images.                                        │\n│ --download      --no-download                          Save images on disk after scraping the urls. [default: download]     │\n│ --force         --no-force                             Force redownload of images already present on disk.                  │\n│                                                        [default: no-force]                                                  │\n│ --headless      --no-headless                          Run browser in headless mode. [default: no-headless]                 │\n│ --help                                                 Show this message and exit.                                          │\n╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n```\n\n### Python\n\n```python\nimport asyncio\nfrom pathlib import Path\n\nfrom scrapix import GoogleImageScraper\n\n\nasync def scrape(\n    query: str,\n    save_dir: Path,\n    limit: int,\n    keywords: list[str],\n    min_res: tuple[int, int] | None,\n    max_res: tuple[int, int] | None,\n):\n    scraper = await GoogleImageScraper.create(save_dir)\n    # search for images and download each image to disk\n    async for url in scraper.get_image_urls(\n        query, limit=limit, keywords=keywords, min_res=min_res, max_res=max_res\n    ):\n        url.download(save_dir=save_dir)\n\n\nasyncio.run(\n    scrape(\n        query=\"duck\",\n        save_dir=Path(\"./images\"),\n        limit=10,\n        keywords=[\"rubber\", \"toy\"],\n        min_res=(640, 640),\n        max_res=(1200, 1200),\n    )\n)\n```\n\n### Graphical User Interface\n\nScrapix provides a simple GUI built with [Streamlit](https://streamlit.io/) to showcase its use. Run\n\n```bash\nstreamlit run scrapix/gui/app.py\n```\n\nto start the GUI in a browser.\n\n\u003cdiv align=\"center\" width=\"100%\"\u003e\n  \u003cimg src=\"resources/screenshots/gui-1.png\" width=\"45%\"/\u003e\n  \u003cimg src=\"resources/screenshots/gui-2.png\" width=\"45%\"/\u003e\n\u003c/div\u003e\n\n## Headless scraping\n\n`headless` mode runs a browser without a user interface, allowing the scraping to run in the background. You can turn on `headless` mode by using the `--headless` option with the CLI, or by passing `headless=True` to `GoogleImageScraper`. **In headless mode, the viewport cannot be modified and is set to Chrome's default in headless mode (800 x 600 pixels).**\n\n## CSS selectors\n\nGoogle Search uses generated class names and ids for the HTML elements displayed on the results page, and these names and ids change from time to time, making it harder to select the relevant elements on the page. The CSS selectors that scrapix uses can be configured with environment variables. There are two CSS selectors\n\n1. `THUMBNAIL_DIV_SELECTOR=\"div.F0uyec\"` is the CSS selector for thumbnail divs on the Google Image results page.\n2. `IMAGE_CLASSES='[\"n3VNCb\", \"iPVvYb\", \"r48jcc\", \"pT0Scc\"]'` is the list of possible CSS classes for the source image displayed after having clicked on a thumbnail in the results page.\n\nThese values can be overridden either by setting the `THUMBNAIL_DIV_SELECTOR` or `IMAGE_CLASSES` envrionment variables in your shell, or by creating a `.env` file with these variables in your working directory.\n\n## Debug\n\nBy default, Scrapix will save a screenshot (`screenshot.png`) and the source HTML (`page.html`) of the current page whenever an exception occurs during scraping and at the end of scraping. These can used for debugging, to check what may have caused the error.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonasrenault%2Fscrapix","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjonasrenault%2Fscrapix","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonasrenault%2Fscrapix/lists"}