{"id":26207536,"url":"https://github.com/sbatururimi/nutch-test","last_synced_at":"2026-05-21T13:06:25.616Z","repository":{"id":87479524,"uuid":"158403799","full_name":"sbatururimi/nutch-test","owner":"sbatururimi","description":"Different example of using Nutch: with Solr, Selenium Hub, standalone web drivers","archived":false,"fork":false,"pushed_at":"2019-02-12T13:32:35.000Z","size":263,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-12-27T00:03:12.609Z","etag":null,"topics":["apache-nutch","apache-solr","selenium"],"latest_commit_sha":null,"homepage":"","language":"Dockerfile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sbatururimi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2018-11-20T14:39:07.000Z","updated_at":"2023-06-02T20:04:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"512833f9-19ce-41ae-80f7-a6e6030fb607","html_url":"https://github.com/sbatururimi/nutch-test","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sbatururimi/nutch-test","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sbatururimi%2Fnutch-test","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sbatururimi%2Fnutch-test/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sbatururimi%2Fnutch-test/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sbatururimi%2Fnutch-test/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sbatururimi","download_url":"https://codeload.github.com/sbatururimi/nutch-test/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sbatururimi%2Fnutch-test/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33301552,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-21T12:23:38.849Z","status":"ssl_error","status_checked_at":"2026-05-21T12:22:11.673Z","response_time":62,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-nutch","apache-solr","selenium"],"created_at":"2025-03-12T05:35:06.268Z","updated_at":"2026-05-21T13:06:25.611Z","avatar_url":"https://github.com/sbatururimi.png","language":"Dockerfile","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Installating Nutch\n## Option 1: Nutch only\ndocker build --force-rm  -t nutch .\n\n## Option 2:  selenium hub + nutch + solr\nSelenium hub with 10 Chrome nodes and 10 Firefox nodes each in headless mode\n```\ndocker-compose -f docker-compose_selenium_nutch_solr.yaml up -d --scale chrome=10 --scale firefox=10\n```\n## Option 3: nutch + solr\n\n```\ndocker-compose -f docker-compose_nutch_solr.yaml up -d\n```\n\n## Option 4: selenium hub + nutch + solr + tor instances\n```\ndocker-compose -f docker-compose_selenium_nutch_solr_tor.yaml up -d --scale firefox=40\n```\n\n# Installing Chrome Driver\n\nThis is an option when not using Selenium HUB.\n\n1) Install Chrome browser:\n* edit sources.list\n\n```\nvi /etc/apt/sources.list\n# add at the bottom of the file\ndeb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main\n```\n\n* Download the signing key\n```\nwget https://dl.google.com/linux/linux_signing_key.pub\napt-key add linux_signing_key.pub\n```\n\n* Install the stable version of Google Chrome\n```\napt update\napt install google-chrome-stable\n```\n\n**NB**\nYou may need to upgrade and then update your packages:\n```\napt upgrade\napt update\n```\n\n2) download chrome driver from the [download page](http://chromedriver.chromium.org/downloads)\n```\ncd ~\nwget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip\nunzip chromedriver_linux64.zip\nrm chromedriver_linux64.zip\n```\n3) Change the location of the ChromeDriver binary path if necessary in nutch-default.xml or nutch-site.xml by specifying\nthe value for `selenium.grid.binary`\n\n# Installing Firefox Driver\n\nThis is an option when not using Selenium HUB.\n\n1) Install Firefox browser:\n\n```\napt install firefox\n```\n\n2) download gecko driver from the [download page](https://www.softwaretestinghelp.com/selenium-webdriver-selenium-tutorial-8/)\n```\ncd ~\nwget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz\ntar -zxvf geckodriver-v0.23.0-linux64.tar.gz\nrm geckodriver-v0.23.0-linux64.tar.gz\n```\n3) Change the location of the gecko binary path if necessary in nutch-default.xml or nutch-site.xml by specifying\nthe value for `selenium.grid.binary`\n\n# Installing Opera Driver\n\nThis is an option when not using Selenium HUB. \n\n1) Install Opera browser by downloading the last version from [link](hhttp://http://download4.operacdn.com/ftp/pub/opera/desktop)\n\n```\nwget http://download4.operacdn.com/ftp/pub/opera/desktop/56.0.3051.99/linux/opera-stable_56.0.3051.99_amd64.deb\ndpkg -i opera-stable_56.0.3051.99_amd64.deb\napt install -f\n```\n**NB**\nUpdate to the appropriate Opera version.\n\n2) download opera driver from the [download page](https://github.com/operasoftware/operachromiumdriver/releases)\n```\ncd ~\nwget wget https://github.com/operasoftware/operachromiumdriver/releases/download/v.2.40/operadriver_linux64.zip\nunzip operadriver_linux64.zip\nrm operadriver_linux64.zip\nmv operadriver_linux64/operadriver /root\nchmod +x operadriver\n```\n\n3) Change the location of the gecko binary path if necessary in nutch-default.xml or nutch-site.xml by specifying\nthe value for `selenium.grid.binary`\n\n\n# Run a test\n1) Set the value for `selenium.driver` in `conf/nutch-site.xml` to the selenium driver you want to test\n2) If you don't have a screen being attached to the server, set `selenium.enable.headless` to `true`\n3) crawl\n```\n# connect to the nutch container\ndocker exec -it nutch bash\n\n# execute the crawl\n/root/nutch/bin/crawl -i -D solr.server.url=http://solr:8983/solr/mycore -s urls crawler 1\n```\n\n4) check the result\n- Test your result in Solr by opening in your browser:\nlocalhost:8983/\n- navigate to the created node `mycore`,\n- execute the default query fetch:\n```\n*:*\n```\n\n# Hints\n\nRegarding the redirects: if you want to follow redirects immediately in the fetcher you simply could adjust `http.redirect.max` (e.g., set it to 3) and Fetcher will follow the redirects immediately.\nBtw., for quick testing you could just set the required parameters in the command-line, e.g.:\n```\n% bin/nutch parsechecker -Dplugin.includes='protocol-selenium|parse-tika' \\\n   -Dselenium.grid.binary=.../geckodriver \\\n   -Dselenium.enable.headless=true \\\n   -followRedirects \\\n   -dumpText https://nutch.apache.org\n\n```\n\n ## License\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/sbatururimi/nutch-test/blob/master/LICENSE.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsbatururimi%2Fnutch-test","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsbatururimi%2Fnutch-test","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsbatururimi%2Fnutch-test/lists"}