{"id":18623561,"url":"https://github.com/codica2/simple-scraper","last_synced_at":"2025-04-11T03:31:45.996Z","repository":{"id":54985915,"uuid":"183619990","full_name":"codica2/simple-scraper","owner":"codica2","description":"A fairly simple gem that will help you simplify the parsing of web pages.","archived":false,"fork":false,"pushed_at":"2021-01-17T18:30:41.000Z","size":79,"stargazers_count":11,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-25T09:01:41.513Z","etag":null,"topics":["gem","parser","parsing","scraper"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codica2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-26T11:52:46.000Z","updated_at":"2024-05-31T16:37:29.000Z","dependencies_parsed_at":"2022-08-14T08:10:51.204Z","dependency_job_id":null,"html_url":"https://github.com/codica2/simple-scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codica2%2Fsimple-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codica2%2Fsimple-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codica2%2Fsimple-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codica2%2Fsimple-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codica2","download_url":"https://codeload.github.com/codica2/simple-scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248335474,"owners_count":21086601,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gem","parser","parsing","scraper"],"created_at":"2024-11-07T04:25:02.534Z","updated_at":"2025-04-11T03:31:40.984Z","avatar_url":"https://github.com/codica2.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Simple Scraper\n\nThis is a fairly simple gem that will help you simplify the parsing of web pages.\n\n## How it works\n\nGem is based on several libraries that do most of the work:\n- [HTTParty](https://github.com/jnunemaker/httparty) is an HTTP client\n- [Parallel](https://github.com/grosser/parallel) allows performing queries in multiple threads\n- [Nokogiri](https://github.com/sparklemotion/nokogiri) is an HTML, XML, SAX, and Reader parser\n\n## Installation\n\nAdd this line to your application's Gemfile:\n\n```ruby\ngem 'simple-scraper'\n```\n\nAnd then execute:\n\n    $ bundle\n\nOr install it yourself in the following way:\n\n    $ gem install simple-scraper\n\n## Usage\n\n```ruby\nrequire 'simple/scraper'\n\nscraper = Simple::Scraper::Parser.new(\n    title: { selector: \"//h1[@class='title']\", handler: -\u003e(els) { els.first.text }, default: 'Ruby' },\n    summary: { selector: \"//h2[@class='summary']\", handler: -\u003e(els) { els.first.text } },\n    link: { selector: \"//a[@class='link']\", handler: -\u003e(els) { els.first['href'] } },\n    text_array: { selector: \"//*[@class='link']\", handler: -\u003e(els) { els.map(\u0026:text) } }\n)\n\nresult1 = scraper.parse('https://www.codica.com/')\nresult2 = scraper.parse(['https://www.codica.com/1', 'https://www.codica.com/2'])\n```\nThe response will be similar to:\n```json\n[\n  {\n    \"title\": \"scraped title text\",\n    \"summary\": \"scraped summary text\",\n    \"link\": \"https://www.codica.com/blog/top-ruby-gems-we-cant-live-without/\",\n    \"text_array\": [\"text\", \"text\" ...]\n  },\n  ...\n]\n```\nOr just find a page:\n```ruby\nSimple::Scraper::Finder.find(url: 'https://www.codica.com/', query: {}, headers: {}) do |page|\n  # page is an instance of Nokogiri::HTML::Document\nend\n```\n\n### Scraper attributes\n\n- *`title, summary, link, text_array`* - Random hash keys, they may be whatever you want.\n- *`selector`* - XPath. With its help you can find desired elements on the page.\n- *`handler`* - Any ruby object that can respond to `#call` method (`proc`, `lambda` or plain ruby class that has defined `#call` method). One argument will be passed to the handler which is an array of the elements found on the page. Each element is an instance of `Nokogiri::XML::Element`. You can read [Nokogiri](https://github.com/sparklemotion/nokogiri) documentation for more info.\n- *`default`* - In case scraper cannot find the desired element using `selector`, the value provided for the `default` attribute will be returned.\n\n### Query parameters and headers\n\n```ruby\nquery = { page: 2 }\nheaders = { 'Authorization': 'Bearer' }\nresult = scraper.parse('https://www.codica.com/', query: query, headers: headers)\n```\n\n## Configuration\n\n### Proxy\n\n```ruby\nSimple::Scraper.configure do |config|\n  config.proxy_addr = 'proxy.something.com'\n  config.proxy_port = 80\n  config.proxy_user = 'user:'\n  config.proxy_pass = 'password'\nend\n```\n\n### Logging\n\n```ruby\nSimple::Scraper.configure do |config|\n  config.logger = Logger.new('path/to/my/logs')\nend\n```\n\u003e By default the logging is turned off\n\n### Multithreading\n\n```ruby\nSimple::Scraper.configure do |config|\n  config.number_of_threads = 20\nend\n```\n\u003e By default scraper works in 1 thread.\n\n### Reset\n\nYou might need to reset configuration to defaults\n\n```ruby\nSimple::Scraper.reset\n```\n\n\u003e Now you can provide new configuration if needed\n\n## License\nCopyright © 2015-2019 Codica. It is released under the [MIT License](https://opensource.org/licenses/MIT).\n\n## About Codica\n\n[![Codica logo](https://www.codica.com/assets/images/logo/logo.svg)](https://www.codica.com)\n\nsimple-scraper is maintained and funded by Codica. The names and logos for Codica are trademarks of Codica.\n\nWe love open source software! See [our other projects](https://github.com/codica2) or [hire us](https://www.codica.com/) to design, develop, and grow your product.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodica2%2Fsimple-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodica2%2Fsimple-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodica2%2Fsimple-scraper/lists"}