{"id":16398572,"url":"https://github.com/metalwarrior665/actor-rust-scraper","last_synced_at":"2025-09-03T15:31:03.208Z","repository":{"id":53777947,"uuid":"176044622","full_name":"metalwarrior665/actor-rust-scraper","owner":"metalwarrior665","description":"Experimental scraper in Rust suited for running locally or on the Apify platform. Inspired by Apify SDK.","archived":false,"fork":false,"pushed_at":"2023-12-12T19:51:09.000Z","size":174,"stargazers_count":13,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-04T12:52:49.272Z","etag":null,"topics":["apify","rust","web-scraper"],"latest_commit_sha":null,"homepage":"https://apify.com/lukaskrivka/rust-scraper","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/metalwarrior665.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-17T01:52:56.000Z","updated_at":"2025-02-26T23:08:08.000Z","dependencies_parsed_at":"2024-10-28T15:26:32.154Z","dependency_job_id":"030798ed-686d-4f77-8b6a-abe22f044c63","html_url":"https://github.com/metalwarrior665/actor-rust-scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/metalwarrior665/actor-rust-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metalwarrior665%2Factor-rust-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metalwarrior665%2Factor-rust-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metalwarrior665%2Factor-rust-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metalwarrior665%2Factor-rust-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/metalwarrior665","download_url":"https://codeload.github.com/metalwarrior665/actor-rust-scraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metalwarrior665%2Factor-rust-scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273465776,"owners_count":25110829,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-03T02:00:09.631Z","response_time":76,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apify","rust","web-scraper"],"created_at":"2024-10-11T05:13:10.420Z","updated_at":"2025-09-03T15:31:02.808Z","avatar_url":"https://github.com/metalwarrior665.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- toc start --\u003e\r\n## Rust Scraper\r\n\r\n\u003c!-- toc end --\u003e\r\n**This is super early version for experimentation. Use at your own risk!**\r\n\r\nSpeed of light scraping with Rust programming language. This is meant to be a faster (but less flexible) version of Apify's JavaScript based [Cheerio Scraper](https://apify.com/apify/cheerio-scraper).\r\n\r\nRust is one of the fastest programming languages out there. In many cases, it matches the speed of C. Although JavaScript offers huge flexibility and development speed, we can use Rust to significantly speed up the crawling and/or reduce costs. Rust scraper is both faster and requires less memory.\r\n\r\n### Changelog\r\nYou can read about fixes and updates in the detailed [changelog file](https://github.com/metalwarrior665/actor-rust-scraper/blob/master/CHANGELOG.md).\r\n\r\n### WARNING!!! Don't DDOS a website!\r\nBecause this scraper is so fast, you can easily take a website down. This matters especially if you scrape **more than few hundred URLs** and use the **async** scraping mode.\r\nHow to prevent that:\r\n- Set reasonable `max_concurrency` input field. You can still scrape very fast and with tiny memory footprint if you set it below `10`.\r\n- If you want to set high `max_concurrency`, only scrape large websites that can handle a load of 1000 requests/second and more.\r\n- Use large pool of proxies so they are not immediately banned.\r\n\r\n**If we see you abusing this scraper for attacks on Apify platform, your account can be banned**.\r\n\r\n### Why it is faster/cheaper than Cheerio Scraper?\r\nRust is statically typed language compiled directly into machine code. Because of this, it can optimize the code into the most efficient structures and algorithms. Of course, it is also job of the programmer to write the code efficiently so we expect further improvements for this scraper.\r\n\r\n- HTML parsing is about 3 times faster because of efficient data structures.\r\n- HTTP requests are also faster.\r\n- Very efficient async implementation with futures (promises in JS).\r\n- Can offload work to other CPU cores via system threads, scales to full actor memory (native JS doesn't support user created threads).\r\n- Much lower memory usage due to efficient data structures.\r\n\r\n### Limitations of this actor (some will be solved in the future)\r\n- This actor only works for scraping pure HTML websites (basically an alternative for [Cheerio Scraper](https://apify.com/apify/cheerio-scraper))\r\n- You can only provide static list of URLs, it cannot enqueue any more.\r\n- It doesn't have a page function, only simplified interface (`extract` object) to define what should be scraped.\r\n- Retries are very simplistic\r\n- It doesn't have a sophisticated concurrency system. It will grow to `max_concurrency` unless CPU gets overwhelmed.\r\n\r\n### Input\r\nInput is a JSON object with the properties below explained in detail on the [Apify Store page](https://apify.com/lukaskrivka/rust-scraper/input-schema). You can also set it up on Apify platform with a nice UI.\r\n\r\n### Data extraction\r\nYou need to provide an [extraction configuration object](https://apify.com/lukaskrivka/rust-scraper/input-schema#extract). This object defines selectors to find on the page, what to extract from those selector and finally names of the fields that the data should be saved as.\r\n\r\n`extract` (array) is an array of objects where each object has:\r\n- `field_name` (string) Defines to which field will the data be assigned in your resulting dataset\r\n- `selector` (string) CSS selector to find the data to extract\r\n- `extract_type` (object) What to extract\r\n    - `type` (string) Can be `Text` or `Attribute`\r\n    - `content` (string) Provide only when `type` is `Attribute`\r\n\r\nFull INPUT example:\r\n```\r\n{\r\n    \"proxy_settings\": {\r\n        \"useApifyProxy\": true,\r\n        \"apifyProxyGroups\": [\"SHADER\"]\r\n    },\r\n    \"urls\": [\r\n        { \"url\": \"https://www.amazon.com/dp/B01CYYU8YW\" },\r\n        { \"url\": \"https://www.amazon.com/dp/B01FXMDA2O\" },\r\n        { \"url\": \"https://www.amazon.com/dp/B00UNT0Y2M\" }\r\n    ],\r\n    \"extract\": [\r\n        {\r\n            \"field_name\": \"title\",\r\n            \"selector\": \"#productTitle\",\r\n            \"extract_type\": {\r\n                \"type\": \"Text\"\r\n            }\r\n        },\r\n        {\r\n            \"field_name\": \"customer_reviews\",\r\n            \"selector\": \"#acrCustomerReviewText\",\r\n            \"extract_type\": {\r\n                \"type\": \"Text\"\r\n            }\r\n        },\r\n        {\r\n            \"field_name\": \"seller_link\",\r\n            \"selector\": \"#bylineInfo\",\r\n            \"extract_type\": {\r\n                \"type\": \"Attribute\",\r\n                \"content\": \"href\"\r\n            }\r\n        }    \r\n    ]\r\n}\r\n```\r\n\r\nOutput example in JSON (This depends purely on your `extract` config)\r\n```\r\n[\r\n    {\r\n        \"seller_link\":\"/Propack/b/ref=bl_dp_s_web_3039360011?ie=UTF8\u0026node=3039360011\u0026field-lbr_brands_browse-bin=Propack\",\"customer_reviews\":\"208 customer reviews\",\r\n        \"title\":\"Propack Twist - Tie Gallon Size Storage Bags 100 Bags Pack Of 4\"\r\n    },\r\n    {\r\n        \"byline_link\":\"/Ziploc/b/ref=bl_dp_s_web_2581449011?ie=UTF8\u0026node=2581449011\u0026field-lbr_brands_browse-bin=Ziploc\",\"customers\":\"561 customer reviews\",\r\n        \"title\":\"Ziploc Gallon Slider Storage Bags, 96 Count\"\r\n    },\r\n    {\r\n        \"byline_link\":\"/Reynolds/b/ref=bl_dp_s_web_2599601011?ie=UTF8\u0026node=2599601011\u0026field-lbr_brands_browse-bin=Reynolds\",\"customers\":\"456 customer reviews\",\r\n        \"title\":\"Reynolds Wrap Aluminum Foil (200 Square Foot Roll)\"\r\n    }\r\n]\r\n```\r\n### Local usage\r\nYou can run this locally if you have Rust installed. You need to build it before running. If you want to use Apify Proxy, don't forget to add your `APIFY_PROXY_PASSWORD` into the environment, otherwise you will get a nasty error.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmetalwarrior665%2Factor-rust-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmetalwarrior665%2Factor-rust-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmetalwarrior665%2Factor-rust-scraper/lists"}