{"id":26294599,"url":"https://github.com/afiore/extraloop","last_synced_at":"2025-10-12T08:43:03.680Z","repository":{"id":2094164,"uuid":"3034617","full_name":"afiore/extraloop","owner":"afiore","description":"Ruby online data extraction toolkit","archived":false,"fork":false,"pushed_at":"2012-03-27T15:41:42.000Z","size":216,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-10-12T08:43:03.052Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://github.com/afiore/extraloop","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/afiore.png","metadata":{"files":{"readme":"README.md","changelog":"History.txt","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2011-12-22T15:23:36.000Z","updated_at":"2013-12-18T07:23:58.000Z","dependencies_parsed_at":"2022-07-14T20:00:33.810Z","dependency_job_id":null,"html_url":"https://github.com/afiore/extraloop","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/afiore/extraloop","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afiore%2Fextraloop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afiore%2Fextraloop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afiore%2Fextraloop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afiore%2Fextraloop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/afiore","download_url":"https://codeload.github.com/afiore/extraloop/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afiore%2Fextraloop/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279010789,"owners_count":26084807,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-15T03:31:09.039Z","updated_at":"2025-10-12T08:43:03.664Z","avatar_url":"https://github.com/afiore.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Extra Loop\n\nA Ruby library for extracting structured data from websites and web based APIs. \nSupports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes with a handy mechanism \nfor iterating over paginated datasets.\n\n## Installation:\n\n    gem install extraloop\n\n## Usage:\n\nA basic scraper that fetches the top 25 websites from [Alexa's daily top 100](www.alexa.com/topsites) list:\n\n    alexa_scraper = ExtraLoop::ScraperBase.\n      new(\"http://www.alexa.com/topsites\").\n      loop_on(\"li.site-listing\").\n        extract(:site_name, \"h2\").\n        extract(:url, \"h2 a\").\n        extract(:description, \".description\").\n      on(:data) { |data| { |record| puts record.site_name } }\n\n    alexa_scraper.run\n\nAn iterative Scraper that fetches URL, title, and publisher from some 110 Google News articles mentioning the keyword _'Egypt'_.\n\n    results = []\n\n    ExtraLoop::IterativeScraper.\n      new(\"https://www.google.com/search?tbm=nws\u0026q=Egypt\").\n      set_iteration(:start, (1..101).step(10)).\n      loop_on(\"h3\") { |nodes| nodes.map(\u0026:parent) }.\n        extract(:title, \"h3.r a\").\n        extract(:url, \"h3.r a\", :href).\n        extract(:source, \"br\") { |node| node.next.text.split(\"-\").first }.\n      on(:data) { |data, response| data.each { |record| results \u003c\u003c record } }.\n      run()\n\n\n## Scraper initialisation signature\n\n    #new(urls, scraper_options, http_options)\n\n- __urls__ - single url, or array of several urls.\n- __scraper_options__ - hash of scraper options (see below).\n- __http_options__ - hash of request options for `Typheous::Request#initialize` (see [API documentation](http://rubydoc.info/github/pauldix/typhoeus/master/Typhoeus/Request#initialize-instance_method) for details).\n\n### scraper options:\n\n* __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'. \n* __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.\n* __log__ - Logging options hash:\n     * __loglevel__  - a symbol specifying the desired log level (defaults to `:info`).\n     * __appenders__ - a list of Logging.appenders object (defaults to `Logging.appenders.sterr`).\n\n## Extractors\n\nExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.\nFor each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract` \nmethod extracts a specific piece of information from an element (e.g. a story's title) and stores it into a record's field.\n\n    # looping over a set of document elements using a CSS3 (or XPath) selector\n    loop_on('div.post')\n\n    # looping \n\n    loop_on { |doc| doc.search('div.post') }\n\n    # using both a selector and a proc (the matched element list is passed in to the proc as its first argument )\n\n    loop_on('div.post') { |posts| posts.reject { |post| post.attr(:class) == 'sticky' } }\n\nBoth the `loop_on` and the `extract` methods may be called with a selector, a block or a combination of the two. By default, when parsing DOM documents, `extract` will call\n`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and a block. The latter is evaluated in the context of the current iteration's element. \n\n    # extract a story's title \n    extract(:title, 'h3')\n\n    # extract a story's url\n    extract(:url, \"a.link-to-story\", :href)\n\n    # extract a description text, separating paragraphs with newlines \n    extract(:description, \"div.description\") { |node| node.css(\"p\").map(\u0026:text).join(\"\\n\") }\n\n### Extracting data from JSON Documents\n\nWhile processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at \nthe `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's \ninitialization options. When format is JSON, the document is parsed using the `yajl` JSON parser and converted into a hash. \nIn this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except it does not support \nCSS3/XPath selectors.\n\nWhen working with JSON data, you can just use a block and have it return the document elements you want to loop on.\n\n    # Fetch a portion of a document using a proc\n    loop_on  { |data| data['query']['categorymembers'] })\n\nAlternatively, the same loop can be defined by passing an array of keys pointing at a hash value located \nat several levels of depth down into the parsed document structure.\n\n    # Same as above, using a hash path\n    loop_on(['query', 'categorymembers'])\n\nWhen fetching fields from a JSON document fragment, `extract` will often not need a block or an array of keys. If called with only\none argument, it will in fact try to fetch a hash value using the provided field name as key.\n\n    # current node:\n    #\n    # {\n    #  'from_user' =\u003e \"johndoe\", \n    #  'text' =\u003e 'bla bla bla',\n    #  'from_user_id'..\n    # }\n\n    # \u003e\u003e extract(:from_user)\n    # =\u003e \"johndoe\"\n\n\n## Iteration methods\n\nThe `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.\n\n### set\\_iteration\n\n* __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.\n* __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.\n\n### continue\\_with\n\nThe second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).\n\n* __iteration_parameter__ - the scraper' iteration parameter.\n* __\u0026block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.\n\n## Running tests\n\nExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:\n\n    cd spec\n    rspec *\n    \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fafiore%2Fextraloop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fafiore%2Fextraloop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fafiore%2Fextraloop/lists"}