{"id":38668533,"url":"https://github.com/openculinary/crawler","last_synced_at":"2026-01-17T09:52:28.814Z","repository":{"id":48064802,"uuid":"223436854","full_name":"openculinary/crawler","owner":"openculinary","description":"The RecipeRadar crawler provides an abstraction layer over external recipe websites, returning data in a format which can be ingested into the RecipeRadar search engine","archived":false,"fork":false,"pushed_at":"2025-05-08T13:21:53.000Z","size":657,"stargazers_count":6,"open_issues_count":4,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-08T14:32:33.727Z","etag":null,"topics":["flask"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openculinary.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-11-22T15:54:10.000Z","updated_at":"2025-05-08T13:21:57.000Z","dependencies_parsed_at":"2024-01-11T20:55:27.428Z","dependency_job_id":"b72ed26d-5b8b-445f-b81b-b16df3b263af","html_url":"https://github.com/openculinary/crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/openculinary/crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openculinary%2Fcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openculinary%2Fcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openculinary%2Fcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openculinary%2Fcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openculinary","download_url":"https://codeload.github.com/openculinary/crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openculinary%2Fcrawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28505565,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T06:57:29.758Z","status":"ssl_error","status_checked_at":"2026-01-17T06:56:03.931Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flask"],"created_at":"2026-01-17T09:52:27.140Z","updated_at":"2026-01-17T09:52:28.801Z","avatar_url":"https://github.com/openculinary.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RecipeRadar Crawler\n\nThe RecipeRadar crawler provides an abstraction layer over external recipe websites, returning data in a format which can be ingested into the RecipeRadar search engine.\n\nMuch of this is possible thanks to the open source [recipe-scrapers](https://pypi.org/project/recipe-scrapers) library; any improvements, fixes, and site coverage added there will benefit the crawler service.\n\nIn addition, scripts are provided to crawl from two readily-available sources of recipe URLs:\n\n* `openrecipes` - a set of ~175k public recipe URLs\n* `reciperadar` - the set of recipe URLs already known to RecipeRadar\n\nThe `reciperadar` set is useful during changes to the crawling and indexing components of the RecipeRadar application itself; it provides a quick way to recrawl and reindex existing recipes.\n\nOutbound requests are routed via [squid](https://www.squid-cache.org) to avoid burdening origin recipe sites with repeated content retrieval requests.\n\n## Install dependencies\n\nMake sure to follow the RecipeRadar [infrastructure](https://codeberg.org/openculinary/infrastructure) setup to ensure all cluster dependencies are available in your environment.\n\n## Development\n\nTo install development tools and run linting and tests locally, execute the following commands:\n\n```sh\n$ make lint tests\n```\n\n## Local Deployment\n\nTo deploy the service to the local infrastructure environment, execute the following commands:\n\n```sh\n$ make\n$ make deploy\n```\n\n## Operations\n\n### Initial data load\n\nTo crawl and index `openrecipes` from scratch, execute the following commands:\n\n```sh\n$ cd openrecipes\n$ make\n$ venv/bin/python crawl.py\n```\n\nNB: This requires you to download the [openrecipes](https://github.com/fictivekin/openrecipes) dataset and extract it to a file named 'recipes.json'\n\n### Recrawling and reindexing\n\nTo recrawl and reindex the entire known `reciperadar` recipe set, execute the following commands:\n\n```sh\n$ cd reciperadar\n$ make\n$ venv/bin/python crawl_urls.py --recrawl\n```\n\nTo reindex `reciperadar` recipes containing products named `tofu`, execute the following command:\n\n```sh\n$ cd reciperadar\n$ make\n$ venv/bin/python recipes.py --reindex --where \"exists (select * from recipe_ingredients as ri join product_names as pn on pn.id = ri.product_name_id where ri.recipe_id = recipes.id and pn.singular = 'tofu')\"\n```\n\nNB: Running either of these commands without the `--reindex` / `--recrawl` argument will run in a 'safe mode' and tell you about the entities which match your query, without performing any actions on them.\n\n### Proxy selection\n\nSometimes individual websites may block or rate-limit the crawler; it's best to avoid making too many requests to an individual website, and to be as respectful as possible of their operational and network costs.\n\nSometimes it can be worth temporarily switching the crawler to use an anonymized proxy service.  Until this is available as a configuration setting, this can be done by updating the crawler application code and redeploying the service.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenculinary%2Fcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenculinary%2Fcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenculinary%2Fcrawler/lists"}