{"id":13492880,"url":"https://github.com/glaucocustodio/tanakai","last_synced_at":"2025-03-28T11:30:57.543Z","repository":{"id":56740726,"uuid":"291950621","full_name":"glaucocustodio/tanakai","owner":"glaucocustodio","description":"Tanakai is a modern web scraping framework written in Ruby. A fork of Kimurai.","archived":false,"fork":true,"pushed_at":"2023-12-14T18:45:10.000Z","size":210,"stargazers_count":260,"open_issues_count":0,"forks_count":15,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-04-14T06:55:25.451Z","etag":null,"topics":["chrome-headless","crawler","kimurai","scraper","scrapy","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"vifreefly/kimuraframework","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/glaucocustodio.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-09-01T09:10:25.000Z","updated_at":"2024-04-09T18:33:50.000Z","dependencies_parsed_at":"2023-02-12T05:17:00.765Z","dependency_job_id":null,"html_url":"https://github.com/glaucocustodio/tanakai","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glaucocustodio%2Ftanakai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glaucocustodio%2Ftanakai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glaucocustodio%2Ftanakai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glaucocustodio%2Ftanakai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/glaucocustodio","download_url":"https://codeload.github.com/glaucocustodio/tanakai/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246020801,"owners_count":20710823,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chrome-headless","crawler","kimurai","scraper","scrapy","webscraping"],"created_at":"2024-07-31T19:01:10.139Z","updated_at":"2025-03-28T11:30:57.222Z","avatar_url":"https://github.com/glaucocustodio.png","language":"Ruby","funding_links":[],"categories":["Ruby"],"sub_categories":[],"readme":"# 🕷 Tanakai\n\n\u003csub\u003e[Liphistius tanakai](https://wsc.nmbe.ch/species/58479/Liphistius_tanakai)\u003c/sub\u003e\n\nTanakai intends to be a maintained fork of [Kimurai](https://github.com/vifreefly/kimuraframework), a modern web scraping framework written in Ruby which **works out of the box with Apparition, Cuprite, Headless Chromium/Firefox and PhantomJS**, or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**\n\n### Goals of this fork:\n\n- [x] add support to [Apparition](https://github.com/twalpole/apparition) and [Cuprite](https://github.com/rubycdp/cuprite)\n- [x] add support to Ruby 3\n- [ ] write tests with RSpec\n- [ ] improve configuration options for Apparition and Cuprite (both have been recently added)\n- [ ] create an awesome logo in the likes of [this](https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png)\n- [ ] have you as new contributor\n\nTanakai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:\n\n```ruby\n# github_spider.rb\nrequire 'tanakai'\n\nclass GithubSpider \u003c Tanakai::Base\n  @name = \"github_spider\"\n  @engine = :selenium_chrome\n  @start_urls = [\"https://github.com/search?q=Ruby%20Web%20Scraping\"]\n  @config = {\n    user_agent: \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36\",\n    before_request: { delay: 4..7 }\n  }\n\n  def parse(response, url:, data: {})\n    response.xpath(\"//ul[@class='repo-list']/div//h3/a\").each do |a|\n      request_to :parse_repo_page, url: absolute_url(a[:href], base: url)\n    end\n\n    if next_page = response.at_xpath(\"//a[@class='next_page']\")\n      request_to :parse, url: absolute_url(next_page[:href], base: url)\n    end\n  end\n\n  def parse_repo_page(response, url:, data: {})\n    item = {}\n\n    item[:owner] = response.xpath(\"//h1//a[@rel='author']\").text\n    item[:repo_name] = response.xpath(\"//h1/strong[@itemprop='name']/a\").text\n    item[:repo_url] = url\n    item[:description] = response.xpath(\"//span[@itemprop='about']\").text.squish\n    item[:tags] = response.xpath(\"//div[@id='topics-list-container']/div/a\").map { |a| a.text.squish }\n    item[:watch_count] = response.xpath(\"//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]\").text.squish\n    item[:star_count] = response.xpath(\"//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]\").text.squish\n    item[:fork_count] = response.xpath(\"//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]\").text.squish\n    item[:last_commit] = response.xpath(\"//span[@itemprop='dateModified']/*\").text\n\n    save_to \"results.json\", item, format: :pretty_json\n  end\nend\n\nGithubSpider.crawl!\n```\n\n\u003cdetails/\u003e\n  \u003csummary\u003eRun: \u003ccode\u003e$ ruby github_spider.rb\u003c/code\u003e\u003c/summary\u003e\n\n```\nI, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Spider: started: github_spider\nD, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance\nD, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled `browser before_request delay`\nD, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 7 seconds before request...\nD, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled custom user-agent\nD, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode\nI, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping\nI, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping\nI, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Info: visits: requests: 1, responses: 1\nD, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 107968\nD, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 5 seconds before request...\nI, [2018-08-22 13:08:32 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping\nI, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping\nI, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Info: visits: requests: 2, responses: 2\nD, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 212542\nD, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 4 seconds before request...\nI, [2018-08-22 13:08:37 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector\n\n...\n\nI, [2018-08-22 13:23:07 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight\nI, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight\nI, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Info: visits: requests: 140, responses: 140\nD, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 204198\nI, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: driver selenium_chrome has been destroyed\n\nI, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Spider: stopped: {:spider_name=\u003e\"github_spider\", :status=\u003e:completed, :environment=\u003e\"development\", :start_time=\u003e2018-08-22 13:08:03 +0400, :stop_time=\u003e2018-08-22 13:23:08 +0400, :running_time=\u003e\"15m, 5s\", :visits=\u003e{:requests=\u003e140, :responses=\u003e140}, :error=\u003enil}\n```\n\u003c/details\u003e\n\n\u003cdetails/\u003e\n  \u003csummary\u003eresults.json\u003c/summary\u003e\n\n```json\n[\n  {\n    \"owner\": \"lorien\",\n    \"repo_name\": \"awesome-web-scraping\",\n    \"repo_url\": \"https://github.com/lorien/awesome-web-scraping\",\n    \"description\": \"List of libraries, tools and APIs for web scraping and data processing.\",\n    \"tags\": [\n      \"awesome\",\n      \"awesome-list\",\n      \"web-scraping\",\n      \"data-processing\",\n      \"python\",\n      \"javascript\",\n      \"php\",\n      \"ruby\"\n    ],\n    \"watch_count\": \"159\",\n    \"star_count\": \"2,423\",\n    \"fork_count\": \"358\",\n    \"last_commit\": \"4 days ago\",\n    \"position\": 1\n  },\n\n  ...\n\n  {\n    \"owner\": \"preston\",\n    \"repo_name\": \"idclight\",\n    \"repo_url\": \"https://github.com/preston/idclight\",\n    \"description\": \"A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.\",\n    \"tags\": [\n\n    ],\n    \"watch_count\": \"6\",\n    \"star_count\": \"1\",\n    \"fork_count\": \"0\",\n    \"last_commit\": \"on Apr 12, 2012\",\n    \"position\": 127\n  }\n]\n```\n\u003c/details\u003e\u003cbr\u003e\n\nOkay, that was easy. How about JavaScript rendered websites with dynamic HTML? Let's scrape a page with infinite scroll:\n\n```ruby\n# infinite_scroll_spider.rb\nrequire 'tanakai'\n\nclass InfiniteScrollSpider \u003c Tanakai::Base\n  @name = \"infinite_scroll_spider\"\n  @engine = :selenium_chrome\n  @start_urls = [\"https://infinite-scroll.com/demo/full-page/\"]\n\n  def parse(response, url:, data: {})\n    posts_headers_path = \"//article/h2\"\n    count = response.xpath(posts_headers_path).count\n\n    loop do\n      browser.execute_script(\"window.scrollBy(0,10000)\") ; sleep 2\n      response = browser.current_response\n\n      new_count = response.xpath(posts_headers_path).count\n      if count == new_count\n        logger.info \"\u003e Pagination is done\" and break\n      else\n        count = new_count\n        logger.info \"\u003e Continue scrolling, current count is #{count}...\"\n      end\n    end\n\n    posts_headers = response.xpath(posts_headers_path).map(\u0026:text)\n    logger.info \"\u003e All posts from page: #{posts_headers.join('; ')}\"\n  end\nend\n\nInfiniteScrollSpider.crawl!\n```\n\n\u003cdetails/\u003e\n  \u003csummary\u003eRun: \u003ccode\u003e$ ruby infinite_scroll_spider.rb\u003c/code\u003e\u003c/summary\u003e\n\n```\nI, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider\nD, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): created browser instance\nD, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode\nI, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/\nI, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/\nI, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1\nD, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: Browser: driver.current_memory: 95463\nI, [2018-08-22 13:33:05 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: \u003e Continue scrolling, current count is 5...\nI, [2018-08-22 13:33:18 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: \u003e Continue scrolling, current count is 9...\nI, [2018-08-22 13:33:20 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: \u003e Continue scrolling, current count is 11...\nI, [2018-08-22 13:33:26 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: \u003e Continue scrolling, current count is 13...\nI, [2018-08-22 13:33:28 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: \u003e Continue scrolling, current count is 15...\nI, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: \u003e Pagination is done\nI, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: \u003e All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released\nI, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed\nI, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Spider: stopped: {:spider_name=\u003e\"infinite_scroll_spider\", :status=\u003e:completed, :environment=\u003e\"development\", :start_time=\u003e2018-08-22 13:32:57 +0400, :stop_time=\u003e2018-08-22 13:33:30 +0400, :running_time=\u003e\"33s\", :visits=\u003e{:requests=\u003e1, :responses=\u003e1}, :error=\u003enil}\n\n```\n\u003c/details\u003e\u003cbr\u003e\n\n\n## Features\n* Scrape JavaScript rendered websites out of the box\n* Supported engines: [Apparition](https://github.com/twalpole/apparition), [Cuprite](https://github.com/rubycdp/cuprite), [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [PhantomJS](https://github.com/ariya/phantomjs) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)\n* Write spider code once, and use it with any supported engine later\n* All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages\n* Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**\n* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates) to skip duplicates\n* Automatically [handle requests errors](#handle-request-errors)\n* Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit\n* Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)\n* [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`\n* **Two modes:** use single file for a simple spider, or [generate](#project-mode) Scrapy-like **project**\n* Convenient development mode with [console](#interactive-console), colorized logger and debugger ([Pry](https://github.com/pry/pry), [Byebug](https://github.com/deivid-rodriguez/byebug))\n* Automated [server environment setup](#setup) (for Ubuntu 18.04) and [deploy](#deploy) using commands `tanakai setup` and `tanakai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)\n* Command-line [runner](#runner) to run all project spiders one-by-one or in parallel\n\n## Table of Contents\n* [Tanakai](#tanakai)\n  * [Features](#features)\n  * [Table of Contents](#table-of-contents)\n  * [Installation](#installation)\n  * [Getting to Know](#getting-to-know)\n    * [Interactive console](#interactive-console)\n    * [Available engines](#available-engines)\n    * [Minimum required spider structure](#minimum-required-spider-structure)\n    * [Method arguments response, url and data](#method-arguments-response-url-and-data)\n    * [browser object](#browser-object)\n    * [request_to method](#request_to-method)\n    * [save_to helper](#save_to-helper)\n    * [Skip duplicates](#skip-duplicates)\n      * [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)\n      * [Storage object](#storage-object)\n    * [Handling request errors](#handling-request-errors)\n      * [skip_request_errors](#skip_request_errors)\n      * [retry_request_errors](#retry_request_errors)\n    * [Logging custom events](#logging-custom-events)\n    * [open_spider and close_spider callbacks](#open_spider-and-close_spider-callbacks)\n    * [TANAKAI_ENV](#tanakai_env)\n    * [Parallel crawling using in_parallel](#parallel-crawling-using-in_parallel)\n    * [Active Support included](#active-support-included)\n    * [Schedule spiders using Cron](#schedule-spiders-using-cron)\n    * [Configuration options](#configuration-options)\n    * [Using Tanakai inside existing Ruby applications](#using-tanakai-inside-existing-ruby-applications)\n      * [crawl! method](#crawl-method)\n      * [parse! method](#parsemethod_name-url-method)\n      * [Tanakai.list and Tanakai.find_by_name](#tanakailist-and-tanakaifind_by_name)\n    * [Automated sever setup and deployment](#automated-sever-setup-and-deployment)\n      * [Setup](#setup)\n      * [Deploy](#deploy)\n  * [Spider @config](#spider-config)\n    * [All available @config options](#all-available-config-options)\n    * [@config settings inheritance](#config-settings-inheritance)\n  * [Project mode](#project-mode)\n    * [Generate new spider](#generate-new-spider)\n    * [Crawl](#crawl)\n    * [List](#list)\n    * [Parse](#parse)\n    * [Pipelines, send_item method](#pipelines-send_item-method)\n    * [Runner](#runner)\n      * [Runner callbacks](#runner-callbacks)\n  * [Chat Support and Feedback](#chat-support-and-feedback)\n  * [License](#license)\n\n\n## Installation\nTanakai requires Ruby version `\u003e= 2.5.0`. Supported platforms: `Linux` and `Mac OS X`.\n\n1) If your system doesn't have the appropriate Ruby version, install it:\n\n\u003cdetails/\u003e\n  \u003csummary\u003eUbuntu 18.04\u003c/summary\u003e\n\n```bash\n# Install required packages for ruby-build\nsudo apt update\nsudo apt install git-core curl zlib1g-dev build-essential libssl-dev libreadline-dev libreadline6-dev libyaml-dev libxml2-dev libxslt1-dev libcurl4-openssl-dev libffi-dev\n\n# Install rbenv and ruby-build\ncd \u0026\u0026 git clone https://github.com/rbenv/rbenv.git ~/.rbenv\necho 'export PATH=\"$HOME/.rbenv/bin:$PATH\"' \u003e\u003e ~/.bashrc\necho 'eval \"$(rbenv init -)\"' \u003e\u003e ~/.bashrc\nexec $SHELL\n\ngit clone https://github.com/rbenv/ruby-build.git ~/.rbenv/plugins/ruby-build\necho 'export PATH=\"$HOME/.rbenv/plugins/ruby-build/bin:$PATH\"' \u003e\u003e ~/.bashrc\nexec $SHELL\n\n# Install latest Ruby\nrbenv install 2.5.3\nrbenv global 2.5.3\n\ngem install bundler\n```\n\u003c/details\u003e\n\n\u003cdetails/\u003e\n  \u003csummary\u003eMac OS X\u003c/summary\u003e\n\n```bash\n# Install Homebrew if you don't have it https://brew.sh/\n# Install rbenv and ruby-build:\nbrew install rbenv ruby-build\n\n# Add rbenv to bash so that it loads every time you open a terminal\necho 'if which rbenv \u003e /dev/null; then eval \"$(rbenv init -)\"; fi' \u003e\u003e ~/.bash_profile\nsource ~/.bash_profile\n\n# Install latest Ruby\nrbenv install 2.5.3\nrbenv global 2.5.3\n\ngem install bundler\n```\n\u003c/details\u003e\n\n2) Install Tanakai gem: `$ gem install tanakai` or `bundle add tanakai`\n\n3) Install browsers with webdrivers:\n\n\u003cdetails/\u003e\n  \u003csummary\u003eUbuntu 18.04\u003c/summary\u003e\n\nNote: for Ubuntu 16.04-18.04 there is available automatic installation using `setup` command:\n```bash\n$ tanakai setup localhost --local --ask-sudo\n```\nIt works using [Ansible](https://github.com/ansible/ansible) so you need to install it first: `$ sudo apt install ansible`. You can check using playbooks [here](lib/tanakai/automation).\n\nIf you chose automatic installation, you can skip the rest of this section and go to [\"Getting to Know\"](#getting-to-know) part. In case if you want to install everything manually:\n\n```bash\n# Install basic tools\nsudo apt install -q -y unzip wget tar openssl\n\n# Install xvfb (for virtual_display headless mode, in additional to native)\nsudo apt install -q -y xvfb\n\n# Install chromium-browser and firefox\nsudo apt install -q -y chromium-browser firefox\n\n# Instal chromedriver (2.44 version)\n# All versions are located here: https://sites.google.com/a/chromium.org/chromedriver/downloads\ncd /tmp \u0026\u0026 wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip\nsudo unzip chromedriver_linux64.zip -d /usr/local/bin\nrm -f chromedriver_linux64.zip\n\n# Install geckodriver (0.23.0 version)\n# All versions are located here: https://github.com/mozilla/geckodriver/releases/\ncd /tmp \u0026\u0026 wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz\nsudo tar -xvzf geckodriver-v0.23.0-linux64.tar.gz -C /usr/local/bin\nrm -f geckodriver-v0.23.0-linux64.tar.gz\n\n# Install PhantomJS (2.1.1)\n# All versions are located here: http://phantomjs.org/download.html\nsudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev\ncd /tmp \u0026\u0026 wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2\ntar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2\nsudo mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib\nsudo ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin\nrm -f phantomjs-2.1.1-linux-x86_64.tar.bz2\n```\n\n\u003c/details\u003e\n\n\u003cdetails/\u003e\n  \u003csummary\u003eMac OS X\u003c/summary\u003e\n\n```bash\n# Install chrome and firefox\nbrew cask install google-chrome firefox\n\n# Install chromedriver (latest)\nbrew cask install chromedriver\n\n# Install geckodriver (latest)\nbrew install geckodriver\n\n# Install PhantomJS (latest)\nbrew install phantomjs\n```\n\u003c/details\u003e\u003cbr\u003e\n\nAlso, if you want to save scraped items to a database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:\n\n\u003cdetails/\u003e\n  \u003csummary\u003eUbuntu 18.04\u003c/summary\u003e\n\nSQlite: `$ sudo apt -q -y install libsqlite3-dev sqlite3`.\n\nIf you want to connect to a remote database, you don't need database server on a local machine (only client):\n```bash\n# Install MySQL client\nsudo apt -q -y install mysql-client libmysqlclient-dev\n\n# Install Postgres client\nsudo apt install -q -y postgresql-client libpq-dev\n\n# Install MongoDB client\nsudo apt install -q -y mongodb-clients\n```\n\nBut if you want to save items to a local database, a database server is required as well:\n```bash\n# Install MySQL client and server\nsudo apt -q -y install mysql-server mysql-client libmysqlclient-dev\n\n# Install  Postgres client and server\nsudo apt install -q -y postgresql postgresql-contrib libpq-dev\n\n# Install MongoDB client and server\n# version 4.0 (check here https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/)\nsudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4\n# for 16.04:\n# echo \"deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse\" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list\n# for 18.04:\necho \"deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse\" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list\nsudo apt update\nsudo apt install -q -y mongodb-org\nsudo service mongod start\n```\n\u003c/details\u003e\n\n\u003cdetails/\u003e\n  \u003csummary\u003eMac OS X\u003c/summary\u003e\n\nSQlite: `$ brew install sqlite3`\n\n```bash\n# Install MySQL client and server\nbrew install mysql\n# Start server if you need it: brew services start mysql\n\n# Install Postgres client and server\nbrew install postgresql\n# Start server if you need it: brew services start postgresql\n\n# Install MongoDB client and server\nbrew install mongodb\n# Start server if you need it: brew services start mongodb\n```\n\u003c/details\u003e\n\n\n## Getting to Know\n### Interactive console\nBefore you get to know all of Tanakai's features, there is `$ tanakai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).\n\n```bash\n$ tanakai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework\n```\n\n\u003cdetails/\u003e\n  \u003csummary\u003eShow output\u003c/summary\u003e\n\n```\n$ tanakai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework\n\nD, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance\nD, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): enabled native headless_mode\nI, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760]  INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework\nI, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760]  INFO -- : Browser: finished get request to: https://github.com/vifreefly/kimuraframework\nD, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] DEBUG -- : Browser: driver.current_memory: 201701\n\nFrom: /home/victor/code/tanakai/lib/tanakai/base.rb @ line 189 Tanakai::Base#console:\n\n    188: def console(response = nil, url: nil, data: {})\n =\u003e 189:   binding.pry\n    190: end\n\n[1] pry(#\u003cTanakai::Base\u003e)\u003e response.xpath(\"//title\").text\n=\u003e \"GitHub - vifreefly/kimuraframework: Modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites\"\n\n[2] pry(#\u003cTanakai::Base\u003e)\u003e ls\nTanakai::Base#methods: browser  console  logger  request_to  save_to  unique?\ninstance variables: @browser  @config  @engine  @logger  @pipelines\nlocals: _  __  _dir_  _ex_  _file_  _in_  _out_  _pry_  data  response  url\n\n[3] pry(#\u003cTanakai::Base\u003e)\u003e ls response\nNokogiri::XML::PP::Node#methods: inspect  pretty_print\nNokogiri::XML::Searchable#methods: %  /  at  at_css  at_xpath  css  search  xpath\nEnumerable#methods:\n  all?         collect         drop        each_with_index   find_all    grep_v    lazy    member?    none?      reject        slice_when  take_while  without\n  any?         collect_concat  drop_while  each_with_object  find_index  group_by  many?   min        one?       reverse_each  sort        to_a        zip\n  as_json      count           each_cons   entries           first       include?  map     min_by     partition  select        sort_by     to_h\n  chunk        cycle           each_entry  exclude?          flat_map    index_by  max     minmax     pluck      slice_after   sum         to_set\n  chunk_while  detect          each_slice  find              grep        inject    max_by  minmax_by  reduce     slice_before  take        uniq\nNokogiri::XML::Node#methods:\n  \u003c=\u003e                   append_class       classes                 document?             has_attribute?      matches?          node_name=        processing_instruction?  to_str\n  ==                    attr               comment?                each                  html?               name=             node_type         read_only?               to_xhtml\n  \u003e                     attribute          content                 elem?                 inner_html          namespace=        parent=           remove                   traverse\n  []                    attribute_nodes    content=                element?              inner_html=         namespace_scopes  parse             remove_attribute         unlink\n  []=                   attribute_with_ns  create_external_subset  element_children      inner_text          namespaced_key?   path              remove_class             values\n  accept                before             create_internal_subset  elements              internal_subset     native_content=   pointer_id        replace                  write_html_to\n  add_class             blank?             css_path                encode_special_chars  key?                next              prepend_child     set_attribute            write_to\n  add_next_sibling      cdata?             decorate!               external_subset       keys                next=             previous          text                     write_xhtml_to\n  add_previous_sibling  child              delete                  first_element_child   lang                next_element      previous=         text?                    write_xml_to\n  after                 children           description             fragment?             lang=               next_sibling      previous_element  to_html                  xml?\n  ancestors             children=          do_xinclude             get_attribute         last_element_child  node_name         previous_sibling  to_s\nNokogiri::XML::Document#methods:\n  \u003c\u003c         canonicalize  collect_namespaces  create_comment  create_entity     decorate    document  encoding   errors   name        remove_namespaces!  root=  to_java  url       version\n  add_child  clone         create_cdata        create_element  create_text_node  decorators  dup       encoding=  errors=  namespaces  root                slop!  to_xml   validate\nNokogiri::HTML::Document#methods: fragment  meta_encoding  meta_encoding=  serialize  title  title=  type\ninstance variables: @decorators  @errors  @node_cache\n\n[4] pry(#\u003cTanakai::Base\u003e)\u003e exit\nI, [2018-08-22 13:43:47 +0400#26079] [M: 47461994677760]  INFO -- : Browser: driver selenium_chrome has been destroyed\n$\n```\n\u003c/details\u003e\u003cbr\u003e\n\nCLI arguments:\n* `--engine` (optional) [engine](#available-drivers) to use. Default is `mechanize`\n* `--url` (optional) url to process. If url is omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).\n\n### Available engines\nTanakai has support for the following engines and can mostly switch between them without the need to rewrite any code:\n\n* `:apparition` - a Chrome driver for Capybara via [CDP protocol](https://chromedevtools.github.io/devtools-protocol/) (no selenium or chromedriver needed). It started as a fork of Poltergeist and attempts to maintain as much compatibility with the Poltergeist API as possible.\n* `:cuprite` - a pure Ruby driver for Capybara. It allows you to run Capybara tests on a headless Chrome or Chromium. Under the hood it uses [Ferrum](https://github.com/rubycdp/ferrum#index) which is high-level API to the browser by [CDP protocol](https://chromedevtools.github.io/devtools-protocol/) (no selenium or chromedriver needed). The design of the driver is as close to Poltergeist as possible though it's not a goal.\n* `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).\n* `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage issues, but Tanakai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).\n* `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper JavaScript rendering.\n* `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.\n\n**Tip:** prepend a `HEADLESS=false` environment variable on the command line (`$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.\n\n\n### Minimum required spider structure\n\u003e You can manually create a spider file, or use the generate command instead: `$ tanakai generate spider simple_spider`\n\n```ruby\nrequire 'tanakai'\n\nclass SimpleSpider \u003c Tanakai::Base\n  @name = \"simple_spider\"\n  @engine = :selenium_chrome\n  @start_urls = [\"https://example.com/\"]\n\n  def parse(response, url:, data: {})\n  end\nend\n\nSimpleSpider.crawl!\n```\n\nWhere:\n* `@name`: name of a spider. You can omit name if use single-file spider\n* `@engine`: engine for a spider\n* `@start_urls`: array of start urls to process one by one inside `parse` method\n* The `parse` method is the entry point, and should always be present in a spider class\n\n\n### Method arguments `response`, `url` and `data`\n\n```ruby\ndef parse(response, url:, data: {})\nend\n```\n\n* `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object): contains parsed HTML code of a processed webpage\n* `url` (String): url of a processed webpage\n* `data` (Hash): uses to pass data between requests\n\n\u003cdetails/\u003e\n  \u003csummary\u003e\u003cstrong\u003eExample how to use \u003ccode\u003edata\u003c/code\u003e\u003c/strong\u003e\u003c/summary\u003e\n\nImagine that there is a product page which doesn't contain product category. Category name present only on category page with pagination. This is the case where we can use `data` to pass category name from `parse` to `parse_product` method:\n\n```ruby\nclass ProductsSpider \u003c Tanakai::Base\n  @engine = :selenium_chrome\n  @start_urls = [\"https://example-shop.com/example-product-category\"]\n\n  def parse(response, url:, data: {})\n    category_name = response.xpath(\"//path/to/category/name\").text\n    response.xpath(\"//path/to/products/urls\").each do |product_url|\n      # Merge category_name with current data hash and pass it next to parse_product method\n      request_to(:parse_product, url: product_url[:href], data: data.merge(category_name: category_name))\n    end\n\n    # ...\n  end\n\n  def parse_product(response, url:, data: {})\n    item = {}\n    # Assign an item's category_name from data[:category_name]\n    item[:category_name] = data[:category_name]\n\n    # ...\n  end\nend\n\n```\n\u003c/details\u003e\u003cbr\u003e\n\n**You can query `response` using [XPath or CSS selectors](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Searchable)**. Check Nokogiri tutorials to understand how to work with `response`:\n* [Parsing HTML with Nokogiri](http://ruby.bastardsbook.com/chapters/html-parsing/) - ruby.bastardsbook.com\n* [HOWTO parse HTML with Ruby \u0026 Nokogiri](https://readysteadycode.com/howto-parse-html-with-ruby-and-nokogiri) - readysteadycode.com\n* [Class: Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) (documentation) - rubydoc.info\n\n\n### `browser` object\n\nA browser object is available from any spider instance method, which is a [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses it to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.\n\nBut if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) `browser` is ready for you:\n\n```ruby\nclass GoogleSpider \u003c Tanakai::Base\n  @name = \"google_spider\"\n  @engine = :selenium_chrome\n  @start_urls = [\"https://www.google.com/\"]\n\n  def parse(response, url:, data: {})\n    browser.fill_in \"q\", with: \"Tanakai web scraping framework\"\n    browser.click_button \"Google Search\"\n\n    # Update response with current_response after interaction with a browser\n    response = browser.current_response\n\n    # Collect results\n    results = response.xpath(\"//div[@class='g']//h3/a\").map do |a|\n      { title: a.text, url: a[:href] }\n    end\n\n    # ...\n  end\nend\n```\n\nCheck out **Capybara cheat sheets** where you can see all available methods **to interact with browser**:\n* [UI Testing with RSpec and Capybara [cheat sheet]](http://cheatrags.com/capybara) - cheatrags.com\n* [Capybara Cheatsheet PDF](https://thoughtbot.com/upcase/test-driven-rails-resources/capybara.pdf) - thoughtbot.com\n* [Class: Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) (documentation) - rubydoc.info\n\n### `request_to` method\n\nFor making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it) and `response_type` (defaults to `:html`). Example:\n\n```ruby\nclass Spider \u003c Tanakai::Base\n  @engine = :selenium_chrome\n  @start_urls = [\"https://example.com/\"]\n\n  def parse(response, url:, data: {})\n    # Process request to `parse_product` method with `https://example.com/some_product` url:\n    request_to :parse_product, url: \"https://example.com/some_product.json\", response_type: :json\n  end\n\n  def parse_product(response, url:, data: {})\n    puts \"JSON parsed from page https://example.com/some_product.json\"\n    puts response\n  end\nend\n```\n\nUnder the hood `request_to` simply call [#visit](https://www.rubydoc.info/github/jnicklas/capybara/Capybara%2FSession:visit) (`browser.visit(url)`) and then required method with arguments:\n\n\u003cdetails/\u003e\n  \u003csummary\u003erequest_to\u003c/summary\u003e\n\n```ruby\ndef request_to(handler, url:, data: {})\n  request_data = { url: url, data: data }\n\n  browser.visit(url)\n  public_send(handler, browser.current_response, **request_data)\nend\n```\n\u003c/details\u003e\u003cbr\u003e\n\n`request_to` just makes things simpler, and without it we could do something like:\n\n\u003cdetails/\u003e\n  \u003csummary\u003eCheck the code\u003c/summary\u003e\n\n```ruby\nclass Spider \u003c Tanakai::Base\n  @engine = :selenium_chrome\n  @start_urls = [\"https://example.com/\"]\n\n  def parse(response, url:, data: {})\n    url_to_process = \"https://example.com/some_product\"\n\n    browser.visit(url_to_process)\n    parse_product(browser.current_response, url: url_to_process)\n  end\n\n  def parse_product(response, url:, data: {})\n    puts \"From page https://example.com/some_product !\"\n  end\nend\n```\n\u003c/details\u003e\n\n### `save_to` helper\n\nSometimes all that you need is to simply save scraped data to a file format, like JSON or CSV. You can use `save_to` for it:\n\n```ruby\nclass ProductsSpider \u003c Tanakai::Base\n  @engine = :selenium_chrome\n  @start_urls = [\"https://example-shop.com/\"]\n\n  # ...\n\n  def parse_product(response, url:, data: {})\n    item = {}\n\n    item[:title] = response.xpath(\"//title/path\").text\n    item[:description] = response.xpath(\"//desc/path\").text.squish\n    item[:price] = response.xpath(\"//price/path\").text[/\\d+/]\u0026.to_f\n\n    # Add each new item to the `scraped_products.json` file:\n    save_to \"scraped_products.json\", item, format: :json\n  end\nend\n```\n\nSupported formats:\n* `:json` JSON\n* `:pretty_json` \"pretty\" JSON (`JSON.pretty_generate`)\n* `:jsonlines` [JSON Lines](http://jsonlines.org/)\n* `:csv` CSV\n\nNote: `save_to` requires data (item to save) to be a `Hash`.\n\nBy default `save_to` add position key to an item hash. You can disable it with `position: false`: `save_to \"scraped_products.json\", item, format: :json, position: false`.\n\n**How helper works:**\n\nWhile the spider is running, each new item will be appended to the output file. On the next run, this helper will clear the contents of the output file, then start appending items to it.\n\n\u003e If you don't want file to be cleared before each run, add option `append: true`: `save_to \"scraped_products.json\", item, format: :json, append: true`\n\n### Skip duplicates\n\nIt's pretty common when websites have duplicated pages. For example when an e-commerce shop has the same products in different categories. To skip duplicates, there is simple `unique?` helper:\n\n```ruby\nclass ProductsSpider \u003c Tanakai::Base\n  @engine = :selenium_chrome\n  @start_urls = [\"https://example-shop.com/\"]\n\n  def parse(response, url:, data: {})\n    response.xpath(\"//categories/path\").each do |category|\n      request_to :parse_category, url: category[:href]\n    end\n  end\n\n  # Check products for uniqueness using product url inside of parse_category:\n  def parse_category(response, url:, data: {})\n    response.xpath(\"//products/path\").each do |product|\n      # Skip url if it's not unique:\n      next unless unique?(:product_url, product[:href])\n      # Otherwise process it:\n      request_to :parse_product, url: product[:href]\n    end\n  end\n\n  # Or/and check products for uniqueness using product sku inside of parse_product:\n  def parse_product(response, url:, data: {})\n    item = {}\n    item[:sku] = response.xpath(\"//product/sku/path\").text.strip.upcase\n    # Don't save product and return from method if there is already saved item with the same sku:\n    return unless unique?(:sku, item[:sku])\n\n    # ...\n    save_to \"results.json\", item, format: :json\n  end\nend\n```\n\n`unique?` helper works pretty simple:\n\n```ruby\n# Check string \"http://example.com\" in scope `url` for a first time:\nunique?(:url, \"http://example.com\")\n# =\u003e true\n\n# Try again:\nunique?(:url, \"http://example.com\")\n# =\u003e false\n```\n\nTo check something for uniqueness, you need to provide a scope:\n\n```ruby\n# `product_url` scope\nunique?(:product_url, \"http://example.com/product_1\")\n\n# `id` scope\nunique?(:id, 324234232)\n\n# `custom` scope\nunique?(:custom, \"Lorem Ipsum\")\n```\n\n#### Automatically skip all duplicated requests urls\n\nIt is possible to automatically skip all already visited urls while calling `request_to` method, using [@config](#all-available-config-options) option `skip_duplicate_requests: true`. With this option, all already visited urls will be automatically skipped. Also check the [@config](#all-available-config-options) for an additional options of this setting.\n\n#### `storage` object\n\n`unique?` method it's just an alias for `storage#unique?`. Storage has several methods:\n\n* `#all` - display storage hash where keys are existing scopes.\n* `#include?(scope, value)` - return `true` if value in the scope exists, and `false` if not\n* `#add(scope, value)` - add value to the scope\n* `#unique?(scope, value)` - method already described above, will return `false` if value in the scope exists, or return `true` + add value to the scope if value in the scope not exists.\n* `#clear!` - reset the whole storage by deleting all values from all scopes.\n\n\n### Handling request errors\nIt is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Tanakai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:\n\n#### skip_request_errors\nYou can automatically skip some of errors while requesting a page using `skip_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught, and request will be skipped. It is a good idea to skip errors like NotFound(404), etc.\n\nFormat for the option: array where elements are error classes or/and hashes. You can use _hash_ format for more flexibility:\n\n```\n@config = {\n  skip_request_errors: [{ error: RuntimeError, message: \"404 =\u003e Net::HTTPNotFound\" }]\n}\n```\nIn this case, provided `message:` will be compared with a full error message using `String#include?`. Also you can use regex instead: `{ error: RuntimeError, message: /404|403/ }`.\n\n#### retry_request_errors\nYou can automatically retry some of errors with a few attempts while requesting a page using `retry_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught and the request will be processed again within a delay.\n\nThere are 3 attempts: first: delay _15 sec_, second: delay _30 sec_, third: delay _45 sec_. If after 3 attempts there is still an exception, then the exception will be raised. It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.\n\nFormat for the option: same like for `skip_request_errors` option.\n\nIf you would like to skip (not raise) error after all retries gone, you can specify `skip_on_failure: true` option:\n\n```ruby\n@config = {\n  retry_request_errors: [{ error: RuntimeError, skip_on_failure: true }]\n}\n```\n\n### Logging custom events\n\nIt is possible to save custom messages to the [run_info](#open_spider-and-close_spider-callbacks) hash using `add_event('Some message')` method. This feature helps you to keep track on important things which happened during crawling without checking the whole spider log (in case if you're logging these messages using `logger`). Example:\n\n```ruby\ndef parse_product(response, url:, data: {})\n  unless response.at_xpath(\"//path/to/add_to_card_button\")\n    add_event(\"Product is sold\") and return\n  end\n\n  # ...\nend\n```\n\n```\n...\nI, [2018-11-28 22:20:19 +0400#7402] [M: 47156576560640]  INFO -- example_spider: Spider: new event (scope: custom): Product is sold\n...\nI, [2018-11-28 22:20:19 +0400#7402] [M: 47156576560640]  INFO -- example_spider: Spider: stopped: {:events=\u003e{:custom=\u003e{\"Product is sold\"=\u003e1}}}\n```\n\n### `open_spider` and `close_spider` callbacks\n\nYou can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action before spider started or after spider has been stopped:\n\n```ruby\nrequire 'tanakai'\n\nclass ExampleSpider \u003c Tanakai::Base\n  @name = \"example_spider\"\n  @engine = :selenium_chrome\n  @start_urls = [\"https://example.com/\"]\n\n  def self.open_spider\n    logger.info \"\u003e Starting...\"\n  end\n\n  def self.close_spider\n    logger.info \"\u003e Stopped!\"\n  end\n\n  def parse(response, url:, data: {})\n    logger.info \"\u003e Scraping...\"\n  end\nend\n\nExampleSpider.crawl!\n```\n\n\u003cdetails/\u003e\n  \u003csummary\u003eOutput\u003c/summary\u003e\n\n```\nI, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Spider: started: example_spider\nI, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840]  INFO -- example_spider: \u003e Starting...\nD, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): created browser instance\nD, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode\nI, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Browser: started get request to: https://example.com/\nI, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Browser: finished get request to: https://example.com/\nI, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Info: visits: requests: 1, responses: 1\nD, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: Browser: driver.current_memory: 82415\nI, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: \u003e Scraping...\nI, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Browser: driver selenium_chrome has been destroyed\nI, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: \u003e Stopped!\nI, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Spider: stopped: {:spider_name=\u003e\"example_spider\", :status=\u003e:completed, :environment=\u003e\"development\", :start_time=\u003e2018-08-22 14:26:32 +0400, :stop_time=\u003e2018-08-22 14:26:34 +0400, :running_time=\u003e\"1s\", :visits=\u003e{:requests=\u003e1, :responses=\u003e1}, :error=\u003enil}\n```\n\u003c/details\u003e\u003cbr\u003e\n\nInside `open_spider` and `close_spider` class methods there is available `run_info` method which contains useful information about spider state:\n\n```ruby\n    11: def self.open_spider\n =\u003e 12:   binding.pry\n    13: end\n\n[1] pry(example_spider)\u003e run_info\n=\u003e {\n  :spider_name=\u003e\"example_spider\",\n  :status=\u003e:running,\n  :environment=\u003e\"development\",\n  :start_time=\u003e2018-08-05 23:32:00 +0400,\n  :stop_time=\u003enil,\n  :running_time=\u003enil,\n  :visits=\u003e{:requests=\u003e0, :responses=\u003e0},\n  :error=\u003enil\n}\n```\n\nInside `close_spider`, `run_info` will be updated:\n\n```ruby\n    15: def self.close_spider\n =\u003e 16:   binding.pry\n    17: end\n\n[1] pry(example_spider)\u003e run_info\n=\u003e {\n  :spider_name=\u003e\"example_spider\",\n  :status=\u003e:completed,\n  :environment=\u003e\"development\",\n  :start_time=\u003e2018-08-05 23:32:00 +0400,\n  :stop_time=\u003e2018-08-05 23:32:06 +0400,\n  :running_time=\u003e6.214,\n  :visits=\u003e{:requests=\u003e1, :responses=\u003e1},\n  :error=\u003enil\n}\n```\n\n`run_info[:status]` helps to determine if spider was finished successfully or failed (possible values: `:completed`, `:failed`):\n\n```ruby\nclass ExampleSpider \u003c Tanakai::Base\n  @name = \"example_spider\"\n  @engine = :selenium_chrome\n  @start_urls = [\"https://example.com/\"]\n\n  def self.close_spider\n    puts \"\u003e\u003e\u003e run info: #{run_info}\"\n  end\n\n  def parse(response, url:, data: {})\n    logger.info \"\u003e Scraping...\"\n    # Let's try to strip nil:\n    nil.strip\n  end\nend\n```\n\n\u003cdetails/\u003e\n  \u003csummary\u003eOutput\u003c/summary\u003e\n\n```\nI, [2018-08-22 14:34:24 +0400#8459] [M: 47020523644400]  INFO -- example_spider: Spider: started: example_spider\nD, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): created browser instance\nD, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode\nI, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400]  INFO -- example_spider: Browser: started get request to: https://example.com/\nI, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400]  INFO -- example_spider: Browser: finished get request to: https://example.com/\nI, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400]  INFO -- example_spider: Info: visits: requests: 1, responses: 1\nD, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: Browser: driver.current_memory: 83351\nI, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400]  INFO -- example_spider: \u003e Scraping...\nI, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400]  INFO -- example_spider: Browser: driver selenium_chrome has been destroyed\n\n\u003e\u003e\u003e run info: {:spider_name=\u003e\"example_spider\", :status=\u003e:failed, :environment=\u003e\"development\", :start_time=\u003e2018-08-22 14:34:24 +0400, :stop_time=\u003e2018-08-22 14:34:26 +0400, :running_time=\u003e2.01, :visits=\u003e{:requests=\u003e1, :responses=\u003e1}, :error=\u003e\"#\u003cNoMethodError: undefined method `strip' for nil:NilClass\u003e\"}\n\nF, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] FATAL -- example_spider: Spider: stopped: {:spider_name=\u003e\"example_spider\", :status=\u003e:failed, :environment=\u003e\"development\", :start_time=\u003e2018-08-22 14:34:24 +0400, :stop_time=\u003e2018-08-22 14:34:26 +0400, :running_time=\u003e\"2s\", :visits=\u003e{:requests=\u003e1, :responses=\u003e1}, :error=\u003e\"#\u003cNoMethodError: undefined method `strip' for nil:NilClass\u003e\"}\nTraceback (most recent call last):\n        6: from example_spider.rb:19:in `\u003cmain\u003e'\n        5: from /home/victor/code/tanakai/lib/tanakai/base.rb:127:in `crawl!'\n        4: from /home/victor/code/tanakai/lib/tanakai/base.rb:127:in `each'\n        3: from /home/victor/code/tanakai/lib/tanakai/base.rb:128:in `block in crawl!'\n        2: from /home/victor/code/tanakai/lib/tanakai/base.rb:185:in `request_to'\n        1: from /home/victor/code/tanakai/lib/tanakai/base.rb:185:in `public_send'\nexample_spider.rb:15:in `parse': undefined method `strip' for nil:NilClass (NoMethodError)\n```\n\u003c/details\u003e\u003cbr\u003e\n\n**Usage example:** if spider finished successfully, send JSON file with scraped items to a remote FTP location, otherwise (if spider failed), skip incompleted results and send email/notification to slack about it:\n\n\u003cdetails/\u003e\n  \u003csummary\u003eExample\u003c/summary\u003e\n\nAlso you can use additional methods `completed?` or `failed?`\n\n```ruby\nclass Spider \u003c Tanakai::Base\n  @engine = :selenium_chrome\n  @start_urls = [\"https://example.com/\"]\n\n  def self.close_spider\n    if completed?\n      send_file_to_ftp(\"results.json\")\n    else\n      send_error_notification(run_info[:error])\n    end\n  end\n\n  def self.send_file_to_ftp(file_path)\n    # ...\n  end\n\n  def self.send_error_notification(error)\n    # ...\n  end\n\n  # ...\n\n  def parse_item(response, url:, data: {})\n    item = {}\n    # ...\n\n    save_to \"results.json\", item, format: :json\n  end\nend\n```\n\u003c/details\u003e\n\n\n### `TANAKAI_ENV`\nTanakai has environments, default is `development`. To provide custom environment pass `TANAKAI_ENV` ENV variable before command: `$ TANAKAI_ENV=production ruby spider.rb`. To access current environment there is `Tanakai.env` method.\n\nUsage example:\n```ruby\nclass Spider \u003c Tanakai::Base\n  @engine = :selenium_chrome\n  @start_urls = [\"https://example.com/\"]\n\n  def self.close_spider\n    if failed? \u0026\u0026 Tanakai.env == \"production\"\n      send_error_notification(run_info[:error])\n    else\n      # Do nothing\n    end\n  end\n\n  # ...\nend\n```\n\n### Parallel crawling using `in_parallel`\nTanakai can process web pages concurrently in one single line: `in_parallel(:parse_product, urls, threads: 3)`, where `:parse_product` is a method to process, `urls` is array of urls to crawl and `threads:` is a number of threads:\n\n```ruby\n# amazon_spider.rb\nrequire 'tanakai'\n\nclass AmazonSpider \u003c Tanakai::Base\n  @name = \"amazon_spider\"\n  @engine = :mechanize\n  @start_urls = [\"https://www.amazon.com/\"]\n\n  def parse(response, url:, data: {})\n    browser.fill_in \"field-keywords\", with: \"Web Scraping Books\"\n    browser.click_on \"Go\"\n\n    # Walk through pagination and collect products urls:\n    urls = []\n    loop do\n      response = browser.current_response\n      response.xpath(\"//li//a[contains(@class, 's-access-detail-page')]\").each do |a|\n        urls \u003c\u003c a[:href].sub(/ref=.+/, \"\")\n      end\n\n      browser.find(:xpath, \"//a[@id='pagnNextLink']\", wait: 1).click rescue break\n    end\n\n    # Process all collected urls concurrently within 3 threads:\n    in_parallel(:parse_book_page, urls, threads: 3)\n  end\n\n  def parse_book_page(response, url:, data: {})\n    item = {}\n\n    item[:title] = response.xpath(\"//h1/span[@id]\").text.squish\n    item[:url] = url\n    item[:price] = response.xpath(\"(//span[contains(@class, 'a-color-price')])[1]\").text.squish.presence\n    item[:publisher] = response.xpath(\"//h2[text()='Product details']/following::b[text()='Publisher:']/following-sibling::text()[1]\").text.squish.presence\n\n    save_to \"books.json\", item, format: :pretty_json\n  end\nend\n\nAmazonSpider.crawl!\n```\n\n\u003cdetails/\u003e\n  \u003csummary\u003eRun: \u003ccode\u003e$ ruby amazon_spider.rb\u003c/code\u003e\u003c/summary\u003e\n\n```\nI, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Spider: started: amazon_spider\nD, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance\nI, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/\nI, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/\nI, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Info: visits: requests: 1, responses: 1\n\nI, [2018-08-22 14:48:43 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Spider: in_parallel: starting processing 52 urls within 3 threads\nD, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance\nI, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/\nD, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance\nI, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/\nD, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance\nI, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/\nI, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/\nI, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Info: visits: requests: 4, responses: 2\nI, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/\nI, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/\nI, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Info: visits: requests: 5, responses: 3\nI, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/\nI, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/\nI, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Info: visits: requests: 6, responses: 4\nI, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Excel-Effective-Scrapes-ebook/dp/B01CMMJGZ8/\n\n...\n\nI, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Info: visits: requests: 51, responses: 49\nI, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed\nI, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Ice-Life-Bill-Rayburn-ebook/dp/B00C0NF1L8/\nI, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Info: visits: requests: 51, responses: 50\nI, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/\nI, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Php-architects-Guide-Scraping-Author/dp/B010DTKYY4/\nI, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Info: visits: requests: 52, responses: 51\nI, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/\nI, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/\nI, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Info: visits: requests: 53, responses: 52\nI, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed\nI, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/\nI, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Info: visits: requests: 53, responses: 53\nI, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed\n\nI, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Spider: in_parallel: stopped processing 52 urls within 3 threads, total time: 29s\nI, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed\n\nI, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Spider: stopped: {:spider_name=\u003e\"amazon_spider\", :status=\u003e:completed, :environment=\u003e\"development\", :start_time=\u003e2018-08-22 14:48:37 +0400, :stop_time=\u003e2018-08-22 14:49:12 +0400, :running_time=\u003e\"35s\", :visits=\u003e{:requests=\u003e53, :responses=\u003e53}, :error=\u003enil}\n\n```\n\u003c/details\u003e\n\n\u003cdetails/\u003e\n  \u003csummary\u003ebooks.json\u003c/summary\u003e\n\n```json\n[\n  {\n    \"title\": \"Web Scraping with Python: Collecting More Data from the Modern Web2nd Edition\",\n    \"url\": \"https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/\",\n    \"price\": \"$26.94\",\n    \"publisher\": \"O'Reilly Media; 2 edition (April 14, 2018)\",\n    \"position\": 1\n  },\n  {\n    \"title\": \"Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, micro services, Docker and AWS\",\n    \"url\": \"https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/\",\n    \"price\": \"$39.99\",\n    \"publisher\": \"Packt Publishing - ebooks Account (February 9, 2018)\",\n    \"position\": 2\n  },\n  {\n    \"title\": \"Web Scraping with Python: Collecting Data from the Modern Web1st Edition\",\n    \"url\": \"https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/\",\n    \"price\": \"$15.75\",\n    \"publisher\": \"O'Reilly Media; 1 edition (July 24, 2015)\",\n    \"position\": 3\n  },\n\n  ...\n\n  {\n    \"title\": \"Instant Web Scraping with Java by Ryan Mitchell (2013-08-26)\",\n    \"url\": \"https://www.amazon.com/Instant-Scraping-Java-Mitchell-2013-08-26/dp/B01FEM76X2/\",\n    \"price\": \"$35.82\",\n    \"publisher\": \"Packt Publishing (2013-08-26) (1896)\",\n    \"position\": 52\n  }\n]\n```\n\u003c/details\u003e\u003cbr\u003e\n\n\u003e Note that [save_to](#save_to-helper) and [unique?](#skip-duplicates-unique-helper) helpers are thread-safe (protected by [Mutex](https://ruby-doc.org/core-2.5.1/Mutex.html)) and can be freely used inside threads.\n\n`in_parallel` can take additional options:\n* `data:` pass with urls custom data hash: `in_parallel(:method, urls, threads: 3, data: { category: \"Scraping\" })`\n* `delay:` set delay between requests: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, delay number will be chosen randomly for each request: `rand (2..5) # =\u003e 3`\n* `engine:` set custom engine than a default one: `in_parallel(:method, urls, threads: 3, engine: :poltergeist_phantomjs)`\n* `config:` pass custom options to config (see [config section](#crawler-config))\n* `response_type:` response should be returned as `:html` or `:json`, defaults to `:html`\n\n### Active Support included\n\nYou can use all the power of familiar [Rails core-ext methods](https://guides.rubyonrails.org/active_support_core_extensions.html#loading-all-core-extensions) for scraping inside Tanakai. Especially take a look at [squish](https://apidock.com/rails/String/squish), [truncate_words](https://apidock.com/rails/String/truncate_words), [titleize](https://apidock.com/rails/String/titleize), [remove](https://apidock.com/rails/String/remove), [present?](https://guides.rubyonrails.org/active_support_core_extensions.html#blank-questionmark-and-present-questionmark) and [presence](https://guides.rubyonrails.org/active_support_core_extensions.html#presence).\n\n### Schedule spiders using Cron\n\n1) Inside spider directory generate [Whenever](https://github.com/javan/whenever) config: `$ tanakai generate schedule`.\n\n\u003cdetails/\u003e\n  \u003csummary\u003e\u003ccode\u003eschedule.rb\u003c/code\u003e\u003c/summary\u003e\n\n```ruby\n### Settings ###\nrequire 'tzinfo'\n\n# Export current PATH to the cron\nenv :PATH, ENV[\"PATH\"]\n\n# Use 24 hour format when using `at:` option\nset :chronic_options, hours24: true\n\n# Use local_to_utc helper to setup execution time using your local timezone instead\n# of server's timezone (which is probably and should be UTC, to check run `$ timedatectl`).\n# Also maybe you'll want to set same timezone in tanakai as well (use `Tanakai.configuration.time_zone =` for that),\n# to have spiders logs in a specific time zone format.\n# Example usage of helper:\n# every 1.day, at: local_to_utc(\"7:00\", zone: \"Europe/Moscow\") do\n#   crawl \"google_spider.com\", output: \"log/google_spider.com.log\"\n# end\ndef local_to_utc(time_string, zone:)\n  TZInfo::Timezone.get(zone).local_to_utc(Time.parse(time_string))\nend\n\n# Note: by default Whenever exports cron commands with :environment == \"production\".\n# Note: Whenever can only append log data to a log file (\u003e\u003e). If you want\n# to overwrite (\u003e) log file before each run, pass lambda:\n# crawl \"google_spider.com\", output: -\u003e { \"\u003e log/google_spider.com.log 2\u003e\u00261\" }\n\n# Project job types\njob_type :crawl,  \"cd :path \u0026\u0026 TANAKAI_ENV=:environment bundle exec tanakai crawl :task :output\"\njob_type :runner, \"cd :path \u0026\u0026 TANAKAI_ENV=:environment bundle exec tanakai runner --jobs :task :output\"\n\n# Single file job type\njob_type :single, \"cd :path \u0026\u0026 TANAKAI_ENV=:environment ruby :task :output\"\n# Single with bundle exec\njob_type :single_bundle, \"cd :path \u0026\u0026 TANAKAI_ENV=:environment bundle exec ruby :task :output\"\n\n### Schedule ###\n# Usage (check examples here https://github.com/javan/whenever#example-schedulerb-file):\n# every 1.day do\n  # Example to schedule a single spider in the project:\n  # crawl \"google_spider.com\", output: \"log/google_spider.com.log\"\n\n  # Example to schedule all spiders in the project using runner. Each spider will write\n  # it's own output to the `log/spider_name.log` file (handled by a runner itself).\n  # Runner output will be written to log/runner.log file.\n  # Argument number it's a count of concurrent jobs:\n  # runner 3, output:\"log/runner.log\"\n\n  # Example to schedule single spider (without project):\n  # single \"single_spider.rb\", output: \"single_spider.log\"\n# end\n\n### How to set a cron schedule ###\n# Run: `$ whenever --update-crontab --load-file config/schedule.rb`.\n# If you don't have whenever command, install the gem: `$ gem install whenever`.\n\n### How to cancel a schedule ###\n# Run: `$ whenever --clear-crontab --load-file config/schedule.rb`.\n```\n\u003c/details\u003e\u003cbr\u003e\n\n2) Add at the bottom of `schedule.rb` following code:\n\n```ruby\nevery 1.day, at: \"7:00\" do\n  single \"example_spider.rb\", output: \"example_spider.log\"\nend\n```\n\n3) Run: `$ whenever --update-crontab --load-file schedule.rb`. Done!\n\nYou can check Whenever examples [here](https://github.com/javan/whenever#example-schedulerb-file). To cancel schedule, run: `$ whenever --clear-crontab --load-file schedule.rb`.\n\n### Configuration options\nYou can configure several options using `configure` block:\n\n```ruby\nTanakai.configure do |config|\n  # Default logger has colored mode in development.\n  # If you would like to disable it, set `colorize_logger` to false.\n  # config.colorize_logger = false\n\n  # Logger level for default logger:\n  # config.log_level = :info\n\n  # Custom logger:\n  # config.logger = Logger.new(STDOUT)\n\n  # Custom time zone (for logs):\n  # config.time_zone = \"UTC\"\n  # config.time_zone = \"Europe/Moscow\"\n\n  # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):\n  # config.selenium_chrome_path = \"/usr/bin/chromium-browser\"\n  # Provide custom selenium chromedriver path (default is \"/usr/local/bin/chromedriver\"):\n  # config.chromedriver_path = \"~/.local/bin/chromedriver\"\nend\n```\n\n### Using Tanakai inside existing Ruby applications\n\nYou can integrate Tanakai spiders (which are just Ruby classes) to an existing Ruby application like Rails or Sinatra, and run them using background jobs (for example). Check the following info to understand the running process of spiders:\n\n#### `.crawl!` method\n\n`.crawl!` (class method) performs a _full run_ of a particular spider. This method will return run_info if run was successful, or an exception if something went wrong.\n\n```ruby\nclass ExampleSpider \u003c Tanakai::Base\n  @name = \"example_spider\"\n  @engine = :mechanize\n  @start_urls = [\"https://example.com/\"]\n\n  def parse(response, url:, data: {})\n    title = response.xpath(\"//title\").text.squish\n  end\nend\n\nExampleSpider.crawl!\n# =\u003e { :spider_name =\u003e \"example_spider\", :status =\u003e :completed, :environment =\u003e \"development\", :start_time =\u003e 2018-08-22 18:20:16 +0400, :stop_time =\u003e 2018-08-22 18:20:17 +0400, :running_time =\u003e 1.216, :visits =\u003e { :requests =\u003e 1, :responses =\u003e 1 }, :items =\u003e { :sent =\u003e 0, :processed =\u003e 0 }, :error =\u003e nil }\n```\n\nYou can't `.crawl!` spider in different thread if it still running (because spider instances store some shared data in the `@run_info` class variable while `crawl`ing):\n\n```ruby\n2.times do |i|\n  Thread.new { p i, ExampleSpider.crawl! }\nend # =\u003e\n\n# 1\n# false\n\n# 0\n# {:spider_name=\u003e\"example_spider\", :status=\u003e:completed, :environment=\u003e\"development\", :start_time=\u003e2018-08-22 18:49:22 +0400, :stop_time=\u003e2018-08-22 18:49:23 +0400, :running_time=\u003e0.801, :visits=\u003e{:requests=\u003e1, :responses=\u003e1}, :items=\u003e{:sent=\u003e0, :processed=\u003e0}, :error=\u003enil}\n```\n\nYou can also pass `data` to `crawl!`:\n\n```ruby\nExampleSpider.crawl!(data: { foo: \"bar\" })\n```\n\nSo what if you're don't care about stats and just want to process request to a particular spider method and get the returning value from this method? Use `.parse!` instead:\n\n#### `.parse!(:method_name, url:)` method\n\n`.parse!` (class method) creates a new spider instance and performs a request to given method with a given url. Value from the method will be returned back:\n\n```ruby\nclass ExampleSpider \u003c Tanakai::Base\n  @name = \"example_spider\"\n  @engine = :mechanize\n  @start_urls = [\"https://example.com/\"]\n\n  def parse(response, url:, data: {})\n    title = response.xpath(\"//title\").text.squish\n  end\nend\n\nExampleSpider.parse!(:parse, url: \"https://example.com/\")\n# =\u003e \"Example Domain\"\n```\n\nLike `.crawl!`, `.parse!` method takes care of a browser instance and kills it (`browser.destroy_driver!`) before returning the value. Unlike `.crawl!`, `.parse!` method can be called from different threads at the same time:\n\n```ruby\nurls = [\"https://www.google.com/\", \"https://www.reddit.com/\", \"https://en.wikipedia.org/\"]\n\nurls.each do |url|\n  Thread.new { p ExampleSpider.parse!(:parse, url: url) }\nend # =\u003e\n\n# \"Google\"\n# \"Wikipedia, the free encyclopedia\"\n# \"reddit: the front page of the internetHotHot\"\n```\n\nKeep in mind, that [save_to](#save_to-helper) and [unique?](#skip-duplicates) helpers are not thread-safe while using `.parse!` method.\n\n#### `Tanakai.list` and `Tanakai.find_by_name()`\n\n```ruby\nclass GoogleSpider \u003c Tanakai::Base\n  @name = \"google_spider\"\nend\n\nclass RedditSpider \u003c Tanakai::Base\n  @name = \"reddit_spider\"\nend\n\nclass WikipediaSpider \u003c Tanakai::Base\n  @name = \"wikipedia_spider\"\nend\n\n# To get the list of all available spider classes:\nTanakai.list\n# =\u003e {\"google_spider\"=\u003eGoogleSpider, \"reddit_spider\"=\u003eRedditSpider, \"wikipedia_spider\"=\u003eWikipediaSpider}\n\n# To find a particular spider class by it's name:\nTanakai.find_by_name(\"reddit_spider\")\n# =\u003e RedditSpider\n```\n\n\n### Automated sever setup and deployment\n\u003e **EXPERIMENTAL**\n\n#### Setup\nYou can automatically setup [required environment](#installation) for Tanakai on the remote server (currently there is only Ubuntu Server 18.04 support) using `$ tanakai setup` command. `setup` will perform installation of: latest Ruby with Rbenv, browsers with webdrivers and in additional databases clients (only clients) for MySQL, Postgres and MongoDB (so you can connect to a remote database from ruby).\n\n\u003e To perform remote server setup, [Ansible](https://github.com/ansible/ansible) is required **on the desktop** machine (to install: Ubuntu: `$ sudo apt install ansible`, Mac OS X: `$ brew install ansible`)\n\n\u003e It's recommended to use regular user to setup the server, not `root`. To create a new user, login to the server `$ ssh root@your_server_ip`, type `$ adduser username` to create a user, and `$ gpasswd -a username sudo` to add new user to a sudo group.\n\nExample:\n\n```bash\n$ tanakai setup deploy@123.123.123.123 --ask-sudo --ssh-key-path path/to/private_key\n```\n\nCLI arguments:\n* `--ask-sudo` pass this option to ask sudo (user) password for system-wide installation of packages (`apt install`)\n* `--ssh-key-path path/to/private_key` authorization on the server using private ssh key. You can omit it if required key already [added to keychain](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#adding-your-ssh-key-to-the-ssh-agent) on your desktop (Ansible uses [SSH agent forwarding](https://developer.github.com/v3/guides/using-ssh-agent-forwarding/))\n* `--ask-auth-pass` authorization on the server using user password, alternative option to `--ssh-key-path`.\n* `-p port_number` custom port for ssh connection (`-p 2222`)\n\n\u003e You can check setup playbook [here](lib/tanakai/automation/setup.yml)\n\n#### Deploy\n\nAfter successful `setup` you can deploy a spider to the remote server using `$ tanakai deploy` command. On each deploy there are performing several tasks: 1) pull repo from a remote origin to `~/repo_name` user directory 2) run `bundle install` 3) Update crontab `whenever --update-crontab` (to update spider schedule from schedule.rb file).\n\nBefore `deploy` make sure that inside spider directory you have: 1) git repository with remote origin (bitbucket, github, etc.) 2) `Gemfile` 3) schedule.rb inside subfolder `config` (`config/schedule.rb`).\n\nExample:\n\n```bash\n$ tanakai deploy deploy@123.123.123.123 --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key\n```\n\nCLI arguments: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus\n* `--repo-url` provide custom repo url (`--repo-url git@bitbucket.org:username/repo_name.git`), otherwise current `origin/master` will be taken (output from `$ git remote get-url origin`)\n* `--repo-key-path` if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with `--ssh-key-path` option)\n\n\u003e You can check deploy playbook [here](lib/tanakai/automation/deploy.yml)\n\n## Spider `@config`\n\nUsing `@config` you can set several options for a spider, like proxy, user-agent, default cookies/headers, delay between requests, browser **memory control** and so on:\n\n```ruby\nclass Spider \u003c Tanakai::Base\n  USER_AGENTS = [\"Chrome\", \"Firefox\", \"Safari\", \"Opera\"]\n  PROXIES = [\"2.3.4.5:8080:http:username:password\", \"3.4.5.6:3128:http\", \"1.2.3.4:3000:socks5\"]\n\n  @engine = :poltergeist_phantomjs\n  @start_urls = [\"https://example.com/\"]\n  @config = {\n    headers: { \"custom_header\" =\u003e \"custom_value\" },\n    cookies: [{ name: \"cookie_name\", value: \"cookie_value\", domain: \".example.com\" }],\n    user_agent: -\u003e { USER_AGENTS.sample },\n    proxy: -\u003e { PROXIES.sample },\n    window_size: [1366, 768],\n    disable_images: true,\n    restart_if: {\n      # Restart browser if provided memory limit (in kilobytes) is exceeded:\n      memory_limit: 350_000\n    },\n    before_request: {\n      # Change user agent before each request:\n      change_user_agent: true,\n      # Change proxy before each request:\n      change_proxy: true,\n      # Clear all cookies and set default cookies (if provided) before each request:\n      clear_and_set_cookies: true,\n      # Process delay before each request:\n      delay: 1..3\n    }\n  }\n\n  def parse(response, url:, data: {})\n    # ...\n  end\nend\n```\n\n### All available `@config` options\n\n```ruby\n@config = {\n  # Custom headers, format: hash. Example: { \"some header\" =\u003e \"some value\", \"another header\" =\u003e \"another value\" }\n  # Works only for :mechanize and :poltergeist_phantomjs engines (Selenium doesn't allow to set/get headers)\n  headers: {},\n\n  # Custom User Agent, format: string or lambda.\n  # Use lambda if you want to rotate user agents before each run:\n  # user_agent: -\u003e { ARRAY_OF_USER_AGENTS.sample }\n  # Works for all engines\n  user_agent: \"Mozilla/5.0 Firefox/61.0\",\n\n  # Custom cookies, format: array of hashes.\n  # Format for a single cookie: { name: \"cookie name\", value: \"cookie value\", domain: \".example.com\" }\n  # Works for all engines\n  cookies: [],\n\n  # Proxy, format: string or lambda. Format of a proxy string: \"ip:port:protocol:user:password\"\n  # `protocol` can be http or socks5. User and password are optional.\n  # Use lambda if you want to rotate proxies before each run:\n  # proxy: -\u003e { ARRAY_OF_PROXIES.sample }\n  # Works for all engines, but keep in mind that Selenium drivers doesn't support proxies\n  # with authorization. Also, Mechanize doesn't support socks5 proxy format (only http)\n  proxy: \"3.4.5.6:3128:http:user:pass\",\n\n  # If enabled, browser will ignore any https errors. It's handy while using a proxy\n  # with self-signed SSL cert (for example Crawlera or Mitmproxy)\n  # Also, it will allow to visit webpages with expires SSL certificate.\n  # Works for all engines\n  ignore_ssl_errors: true,\n\n  # Custom window size, works for all engines\n  window_size: [1366, 768],\n\n  # Skip images downloading if true, works for all engines\n  disable_images: true,\n\n  # Selenium engines only: headless mode, `:native` or `:virtual_display` (default is :native)\n  # Although native mode has a better performance, virtual display mode\n  # sometimes can be useful. For example, some websites can detect (and block)\n  # headless chrome, so you can use virtual_display mode instead\n  headless_mode: :native,\n\n  # This option tells the browser not to use a proxy for the provided list of domains or IP addresses.\n  # Format: array of strings. Works only for :selenium_firefox and selenium_chrome\n  proxy_bypass_list: [],\n\n  # Option to provide custom SSL certificate. Works only for :poltergeist_phantomjs and :mechanize\n  ssl_cert_path: \"path/to/ssl_cert\",\n\n  # Inject some JavaScript code to the browser.\n  # Format: array of strings, where each string is a path to JS file.\n  # Works only for poltergeist_phantomjs engine (Selenium doesn't support JS code injection)\n  extensions: [\"lib/code_to_inject.js\"],\n\n  # Automatically skip duplicated (already visited) urls when using `request_to` method.\n  # Possible values: `true` or `hash` with options.\n  # In case of `true`, all visited urls will be added to the storage's scope `:requests_urls`\n  # and if url already contains in this scope, request will be skipped.\n  # You can configure this setting by providing additional options as hash:\n  # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:\n  # `scope:` - use custom scope than `:requests_urls`\n  # `check_only:` - if true, then scope will be only checked for url, url will not\n  # be added to the scope if scope doesn't contains it.\n  # works for all drivers\n  skip_duplicate_requests: true,\n\n  # Automatically skip provided errors while requesting a page.\n  # If raised error matches one of the errors in the list, then this error will be caught,\n  # and request will be skipped.\n  # It is a good idea to skip errors like NotFound(404), etc.\n  # Format: array where elements are error classes or/and hashes. You can use hash format\n  # for more flexibility: `{ error: \"RuntimeError\", message: \"404 =\u003e Net::HTTPNotFound\" }`.\n  # Provided `message:` will be compared with a full error message using `String#include?`. Also\n  # you can use regex instead: `{ error: \"RuntimeError\", message: /404|403/ }`.\n  skip_request_errors: [{ error: RuntimeError, message: \"404 =\u003e Net::HTTPNotFound\" }],\n\n  # Automatically retry provided errors with a few attempts while requesting a page.\n  # If raised error matches one of the errors in the list, then this error will be caught\n  # and the request will be processed again within a delay. There are 3 attempts:\n  # first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.\n  # If after 3 attempts there is still an exception, then the exception will be raised.\n  # It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.\n  # Format: same like for `skip_request_errors` option.\n  retry_request_errors: [Net::ReadTimeout],\n\n  # Handle page encoding while parsing html response using Nokogiri. There are two modes:\n  # Auto (`:auto`) (try to fetch correct encoding from \u003cmeta http-equiv=\"Content-Type\"\u003e or \u003cmeta charset\u003e tags)\n  # Set required encoding manually, example: `encoding: \"GB2312\"` (Set required encoding manually)\n  # Default this option is unset.\n  encoding: nil,\n\n  # Restart browser if one of the options is true:\n  restart_if: {\n    # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)\n    memory_limit: 350_000,\n\n    # Restart browser if provided requests limit is exceeded (works for all engines)\n    requests_limit: 100\n  },\n\n  # Perform several actions before each request:\n  before_request: {\n    # Change proxy before each request. The `proxy:` option above should be presented\n    # and has lambda format. Works only for poltergeist and mechanize engines\n    # (Selenium doesn't support proxy rotation).\n    change_proxy: true,\n\n    # Change user agent before each request. The `user_agent:` option above should be presented\n    # and has lambda format. Works only for poltergeist and mechanize engines\n    # (selenium doesn't support to get/set headers).\n    change_user_agent: true,\n\n    # Clear all cookies before each request, works for all engines\n    clear_cookies: true,\n\n    # If you want to clear all cookies + set custom cookies (`cookies:` option above should be presented)\n    # use this option instead (works for all engines)\n    clear_and_set_cookies: true,\n\n    # Global option to set delay between requests.\n    # Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,\n    # delay number will be chosen randomly for each request: `rand (2..5) # =\u003e 3`\n    delay: 1..3\n  }\n}\n```\n\nAs you can see, most of the options are universal for any engine.\n\n### `@config` settings inheritance\nSettings can be inherited:\n\n```ruby\nclass ApplicationSpider \u003c Tanakai::Base\n  @engine = :poltergeist_phantomjs\n  @config = {\n    user_agent: \"Firefox\",\n    disable_images: true,\n    restart_if: { memory_limit: 350_000 },\n    before_request: { delay: 1..2 }\n  }\nend\n\nclass CustomSpider \u003c ApplicationSpider\n  @name = \"custom_spider\"\n  @start_urls = [\"https://example.com/\"]\n  @config = {\n    before_request: { delay: 4..6 }\n  }\n\n  def parse(response, url:, data: {})\n    # ...\n  end\nend\n```\n\nHere, `@config` of `CustomSpider` will be _[deep merged](https://apidock.com/rails/Hash/deep_merge)_ with `ApplicationSpider` config, so `CustomSpider` will keep all inherited options with only `delay` updated.\n\n## Project mode\n\nTanakai can work in project mode ([Like Scrapy](https://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project)). To generate a new project, run: `$ tanakai generate project web_spiders` (where `web_spiders` is a name of project).\n\nStructure of the project:\n\n```bash\n.\n├── config/\n│   ├── initializers/\n│   ├── application.rb\n│   ├── automation.yml\n│   ├── boot.rb\n│   └── schedule.rb\n├── spiders/\n│   └── application_spider.rb\n├── db/\n├── helpers/\n│   └── application_helper.rb\n├── lib/\n├── log/\n├── pipelines/\n│   ├── validator.rb\n│   └── saver.rb\n├── tmp/\n├── .env\n├── Gemfile\n├── Gemfile.lock\n└── README.md\n```\n\n\u003cdetails/\u003e\n  \u003csummary\u003eDescription\u003c/summary\u003e\n\n* `config/` folder for configutation files\n  * `config/initializers` [Rails-like initializers](https://guides.rubyonrails.org/configuring.html#using-initializer-files) to load custom code at start of framework\n  * `config/application.rb` configuration settings for Tanakai (`Tanakai.configure do` block)\n  * `config/automation.yml` specify some settings for [setup and deploy](#automated-sever-setup-and-deployment)\n  * `config/boot.rb` loads framework and project\n  * `config/schedule.rb` Cron [schedule for spiders](#schedule-spiders-using-cron)\n* `spiders/` folder for spiders\n  * `spiders/application_spider.rb` Base parent class for all spiders\n* `db/` store here all database files (`sqlite`, `json`, `csv`, etc.)\n* `helpers/` Rails-like helpers for spiders\n  * `helpers/application_helper.rb` all methods inside ApplicationHelper module will be available for all spiders\n* `lib/` put here custom Ruby code\n* `log/` folder for logs\n* `pipelines/` folder for [Scrapy-like](https://doc.scrapy.org/en/latest/topics/item-pipeline.html) pipelines. One file = one pipeline\n  * `pipelines/validator.rb` example pipeline to validate item\n  * `pipelines/saver.rb` example pipeline to save item\n* `tmp/` folder for temp. files\n* `.env` file to store ENV variables for project and load them using [Dotenv](https://github.com/bkeepers/dotenv)\n* `Gemfile` dependency file\n* `Readme.md` example project readme\n\u003c/details\u003e\n\n\n### Generate new spider\nTo generate a new spider in the project, run:\n\n```bash\n$ tanakai generate spider example_spider\n      create  spiders/example_spider.rb\n```\n\nCommand will generate a new spider class inherited from `ApplicationSpider`:\n\n```ruby\nclass ExampleSpider \u003c ApplicationSpider\n  @name = \"example_spider\"\n  @start_urls = []\n  @config = {}\n\n  def parse(response, url:, data: {})\n  end\nend\n```\n\n### Crawl\nTo run a particular spider in the project, run: `$ bundle exec tanakai crawl example_spider`. Don't forget to add `bundle exec` before command to load required environment.\n\n### List\nTo list all project spiders, run: `$ bundle exec tanakai list`\n\n### Parse\nFor project spiders you can use `$ tanakai parse` command which helps to debug spiders:\n\n```bash\n$ bundle exec tanakai parse example_spider parse_product --url https://example-shop.com/product-1\n```\n\nwhere `example_spider` is a spider to run, `parse_product` is a spider method to process and `--url` is url to open inside processing method.\n\n### Pipelines, `send_item` method\nYou can use item pipelines to organize and store in one place item processing logic for all project spiders (also check Scrapy [description of pipelines](https://doc.scrapy.org/en/latest/topics/item-pipeline.html#item-pipeline)).\n\nImagine if you have three spiders where each of them crawls different e-commerce shop and saves only shoe positions. For each spider, you want to save items only with \"shoe\" category, unique sku, valid title/price and with existing images. To avoid code duplication between spiders, use pipelines:\n\n\u003cdetails/\u003e\n  \u003csummary\u003eExample\u003c/summary\u003e\n\npipelines/validator.rb\n```ruby\nclass Validator \u003c Tanakai::Pipeline\n  def process_item(item, options: {})\n    # Here you can validate item and raise `DropItemError`\n    # if one of the validations failed. Examples:\n\n    # Drop item if it's category is not \"shoe\":\n    if item[:category] != \"shoe\"\n      raise DropItemError, \"Wrong item category\"\n    end\n\n    # Check item sku for uniqueness using buit-in unique? helper:\n    unless unique?(:sku, item[:sku])\n      raise DropItemError, \"Item sku is not unique\"\n    end\n\n    # Drop item if title length shorter than 5 symbols:\n    if item[:title].size \u003c 5\n      raise DropItemError, \"Item title is short\"\n    end\n\n    # Drop item if price is not present\n    unless item[:price].present?\n      raise DropItemError, \"item price is not present\"\n    end\n\n    # Drop item if it doesn't contains any images:\n    unless item[:images].present?\n      raise DropItemError, \"Item images are not present\"\n    end\n\n    # Pass item to the next pipeline (if it wasn't dropped):\n    item\n  end\nend\n\n```\n\npipelines/saver.rb\n```ruby\nclass Saver \u003c Tanakai::Pipeline\n  def process_item(item, options: {})\n    # Here you can save item to the database, send it to a remote API or\n    # simply save item to a file format using `save_to` helper:\n\n    # To get the name of current spider: `spider.class.name`\n    save_to \"db/#{spider.class.name}.json\", item, format: :json\n\n    item\n  end\nend\n```\n\nspiders/application_spider.rb\n```ruby\nclass ApplicationSpider \u003c Tanakai::Base\n  @engine = :selenium_chrome\n  # Define pipelines (by order) for all spiders:\n  @pipelines = [:validator, :saver]\nend\n```\n\nspiders/shop_spider_1.rb\n```ruby\nclass ShopSpiderOne \u003c ApplicationSpider\n  @name = \"shop_spider_1\"\n  @start_urls = [\"https://shop-1.com\"]\n\n  # ...\n\n  def parse_product(response, url:, data: {})\n    # ...\n\n    # Send item to pipelines:\n    send_item item\n  end\nend\n```\n\nspiders/shop_spider_2.rb\n```ruby\nclass ShopSpiderTwo \u003c ApplicationSpider\n  @name = \"shop_spider_2\"\n  @start_urls = [\"https://shop-2.com\"]\n\n  def parse_product(response, url:, data: {})\n    # ...\n\n    # Send item to pipelines:\n    send_item item\n  end\nend\n```\n\nspiders/shop_spider_3.rb\n```ruby\nclass ShopSpiderThree \u003c ApplicationSpider\n  @name = \"shop_spider_3\"\n  @start_urls = [\"https://shop-3.com\"]\n\n  def parse_product(response, url:, data: {})\n    # ...\n\n    # Send item to pipelines:\n    send_item item\n  end\nend\n```\n\u003c/details\u003e\u003cbr\u003e\n\nWhen you start using pipelines, there are stats for items appears:\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample\u003c/summary\u003e\n\npipelines/validator.rb\n```ruby\nclass Validator \u003c Tanakai::Pipeline\n  def process_item(item, options: {})\n    if item[:star_count] \u003c 10\n      raise DropItemError, \"Repository doesn't have enough stars\"\n    end\n\n    item\n  end\nend\n```\n\nspiders/github_spider.rb\n```ruby\nclass GithubSpider \u003c ApplicationSpider\n  @name = \"github_spider\"\n  @engine = :selenium_chrome\n  @pipelines = [:validator]\n  @start_urls = [\"https://github.com/search?q=Ruby%20Web%20Scraping\"]\n  @config = {\n    user_agent: \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36\",\n    before_request: { delay: 4..7 }\n  }\n\n  def parse(response, url:, data: {})\n    response.xpath(\"//ul[@class='repo-list']/div//h3/a\").each do |a|\n      request_to :parse_repo_page, url: absolute_url(a[:href], base: url)\n    end\n\n    if next_page = response.at_xpath(\"//a[@class='next_page']\")\n      request_to :parse, url: absolute_url(next_page[:href], base: url)\n    end\n  end\n\n  def parse_repo_page(response, url:, data: {})\n    item = {}\n\n    item[:owner] = response.xpath(\"//h1//a[@rel='author']\").text\n    item[:repo_name] = response.xpath(\"//h1/strong[@itemprop='name']/a\").text\n    item[:repo_url] = url\n    item[:description] = response.xpath(\"//span[@itemprop='about']\").text.squish\n    item[:tags] = response.xpath(\"//div[@id='topics-list-container']/div/a\").map { |a| a.text.squish }\n    item[:watch_count] = response.xpath(\"//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]\").text.squish.delete(\",\").to_i\n    item[:star_count] = response.xpath(\"//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]\").text.squish.delete(\",\").to_i\n    item[:fork_count] = response.xpath(\"//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]\").text.squish.delete(\",\").to_i\n    item[:last_commit] = response.xpath(\"//span[@itemprop='dateModified']/*\").text\n\n    send_item item\n  end\nend\n```\n\n```\n$ bundle exec tanakai crawl github_spider\n\nI, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Spider: started: github_spider\nD, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance\nI, [2018-08-22 15:56:40 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping\nI, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping\nI, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Info: visits: requests: 1, responses: 1\nD, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 116182\nD, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 5 seconds before request...\n\nI, [2018-08-22 15:56:49 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping\nI, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping\nI, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Info: visits: requests: 2, responses: 2\nD, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 217432\nD, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...\nI, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Pipeline: processed: {\"owner\":\"lorien\",\"repo_name\":\"awesome-web-scraping\",\"repo_url\":\"https://github.com/lorien/awesome-web-scraping\",\"description\":\"List of libraries, tools and APIs for web scraping and data processing.\",\"tags\":[\"awesome\",\"awesome-list\",\"web-scraping\",\"data-processing\",\"python\",\"javascript\",\"php\",\"ruby\"],\"watch_count\":159,\"star_count\":2423,\"fork_count\":358,\"last_commit\":\"4 days ago\"}\nI, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Info: items: sent: 1, processed: 1\nD, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 6 seconds before request...\n\n...\n\nI, [2018-08-22 16:11:50 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight\nI, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight\nI, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Info: visits: requests: 140, responses: 140\nD, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 211713\n\nD, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...\nE, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] ERROR -- github_spider: Pipeline: dropped: #\u003cTanakai::Pipeline::DropItemError: Repository doesn't have enough stars\u003e, item: {:owner=\u003e\"preston\", :repo_name=\u003e\"idclight\", :repo_url=\u003e\"https://github.com/preston/idclight\", :description=\u003e\"A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.\", :tags=\u003e[], :watch_count=\u003e6, :star_count=\u003e1, :fork_count=\u003e0, :last_commit=\u003e\"on Apr 12, 2012\"}\n\nI, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Info: items: sent: 127, processed: 12\n\nI, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: driver selenium_chrome has been destroyed\nI, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Spider: stopped: {:spider_name=\u003e\"github_spider\", :status=\u003e:completed, :environment=\u003e\"development\", :start_time=\u003e2018-08-22 15:56:35 +0400, :stop_time=\u003e2018-08-22 16:11:51 +0400, :running_time=\u003e\"15m, 16s\", :visits=\u003e{:requests=\u003e140, :responses=\u003e140}, :items=\u003e{:sent=\u003e127, :processed=\u003e12}, :error=\u003enil}\n```\n\u003c/details\u003e\u003cbr\u003e\n\nAlso, you can pass custom options to pipeline from a particular spider if you want to change pipeline behavior for this spider:\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample\u003c/summary\u003e\n\nspiders/custom_spider.rb\n```ruby\nclass CustomSpider \u003c ApplicationSpider\n  @name = \"custom_spider\"\n  @start_urls = [\"https://example.com\"]\n  @pipelines = [:validator]\n\n  # ...\n\n  def parse_item(response, url:, data: {})\n    # ...\n\n    # Pass custom option `skip_uniq_checking` for Validator pipeline:\n    send_item item, validator: { skip_uniq_checking: true }\n  end\nend\n\n```\n\npipelines/validator.rb\n```ruby\nclass Validator \u003c Tanakai::Pipeline\n  def process_item(item, options: {})\n\n    # Do not check item sku for uniqueness if options[:skip_uniq_checking] is true\n    if options[:skip_uniq_checking] != true\n      raise DropItemError, \"Item sku is not unique\" unless unique?(:sku, item[:sku])\n    end\n  end\nend\n```\n\u003c/details\u003e\n\n\n### Runner\n\nYou can run project spiders one by one or in parallel using `$ tanakai runner` command:\n\n```\n$ bundle exec tanakai list\ncustom_spider\nexample_spider\ngithub_spider\n\n$ bundle exec tanakai runner -j 3\n\u003e\u003e\u003e Runner: started: {:id=\u003e1533727423, :status=\u003e:processing, :start_time=\u003e2018-08-08 15:23:43 +0400, :stop_time=\u003enil, :environment=\u003e\"development\", :concurrent_jobs=\u003e3, :spiders=\u003e[\"custom_spider\", \"github_spider\", \"example_spider\"]}\n\u003e Runner: started spider: custom_spider, index: 0\n\u003e Runner: started spider: github_spider, index: 1\n\u003e Runner: started spider: example_spider, index: 2\n\u003c Runner: stopped spider: custom_spider, index: 0\n\u003c Runner: stopped spider: example_spider, index: 2\n\u003c Runner: stopped spider: github_spider, index: 1\n\u003c\u003c\u003c Runner: stopped: {:id=\u003e1533727423, :status=\u003e:completed, :start_time=\u003e2018-08-08 15:23:43 +0400, :stop_time=\u003e2018-08-08 15:25:11 +0400, :environment=\u003e\"development\", :concurrent_jobs=\u003e3, :spiders=\u003e[\"custom_spider\", \"github_spider\", \"example_spider\"]}\n```\n\nEach spider runs in a separate process. Spiders logs available at `log/` folder. Pass `-j` option to specify how many spiders should be processed at the same time (default is 1).\n\nYou can provide additional arguments like `--include` or `--exclude` to specify which spiders to run:\n\n```bash\n# Run only custom_spider and example_spider:\n$ bundle exec tanakai runner --include custom_spider example_spider\n\n# Run all except github_spider:\n$ bundle exec tanakai runner --exclude github_spider\n```\n\n#### Runner callbacks\n\nYou can perform custom actions before runner starts and after runner stops using `config.runner_at_start_callback` and `config.runner_at_stop_callback`. Check [config/application.rb](lib/tanakai/template/config/application.rb) to see example.\n\n## Testing\nTo run tests:\n```bash\nbundle exec rspec\n```\n\n## Chat Support and Feedback\nSubmit an issue on GitHub and we'll try to address it in a timely manner.\n\n## License\nThis gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglaucocustodio%2Ftanakai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fglaucocustodio%2Ftanakai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglaucocustodio%2Ftanakai/lists"}