https://github.com/sbatururimi/nutch-test
Different example of using Nutch: with Solr, Selenium Hub, standalone web drivers
https://github.com/sbatururimi/nutch-test
apache-nutch apache-solr selenium
Last synced: about 1 month ago
JSON representation
Different example of using Nutch: with Solr, Selenium Hub, standalone web drivers
- Host: GitHub
- URL: https://github.com/sbatururimi/nutch-test
- Owner: sbatururimi
- License: mit
- Created: 2018-11-20T14:39:07.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-02-12T13:32:35.000Z (over 7 years ago)
- Last Synced: 2025-12-27T00:03:12.609Z (6 months ago)
- Topics: apache-nutch, apache-solr, selenium
- Language: Dockerfile
- Homepage:
- Size: 257 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Installating Nutch
## Option 1: Nutch only
docker build --force-rm -t nutch .
## Option 2: selenium hub + nutch + solr
Selenium hub with 10 Chrome nodes and 10 Firefox nodes each in headless mode
```
docker-compose -f docker-compose_selenium_nutch_solr.yaml up -d --scale chrome=10 --scale firefox=10
```
## Option 3: nutch + solr
```
docker-compose -f docker-compose_nutch_solr.yaml up -d
```
## Option 4: selenium hub + nutch + solr + tor instances
```
docker-compose -f docker-compose_selenium_nutch_solr_tor.yaml up -d --scale firefox=40
```
# Installing Chrome Driver
This is an option when not using Selenium HUB.
1) Install Chrome browser:
* edit sources.list
```
vi /etc/apt/sources.list
# add at the bottom of the file
deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main
```
* Download the signing key
```
wget https://dl.google.com/linux/linux_signing_key.pub
apt-key add linux_signing_key.pub
```
* Install the stable version of Google Chrome
```
apt update
apt install google-chrome-stable
```
**NB**
You may need to upgrade and then update your packages:
```
apt upgrade
apt update
```
2) download chrome driver from the [download page](http://chromedriver.chromium.org/downloads)
```
cd ~
wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
rm chromedriver_linux64.zip
```
3) Change the location of the ChromeDriver binary path if necessary in nutch-default.xml or nutch-site.xml by specifying
the value for `selenium.grid.binary`
# Installing Firefox Driver
This is an option when not using Selenium HUB.
1) Install Firefox browser:
```
apt install firefox
```
2) download gecko driver from the [download page](https://www.softwaretestinghelp.com/selenium-webdriver-selenium-tutorial-8/)
```
cd ~
wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
tar -zxvf geckodriver-v0.23.0-linux64.tar.gz
rm geckodriver-v0.23.0-linux64.tar.gz
```
3) Change the location of the gecko binary path if necessary in nutch-default.xml or nutch-site.xml by specifying
the value for `selenium.grid.binary`
# Installing Opera Driver
This is an option when not using Selenium HUB.
1) Install Opera browser by downloading the last version from [link](hhttp://http://download4.operacdn.com/ftp/pub/opera/desktop)
```
wget http://download4.operacdn.com/ftp/pub/opera/desktop/56.0.3051.99/linux/opera-stable_56.0.3051.99_amd64.deb
dpkg -i opera-stable_56.0.3051.99_amd64.deb
apt install -f
```
**NB**
Update to the appropriate Opera version.
2) download opera driver from the [download page](https://github.com/operasoftware/operachromiumdriver/releases)
```
cd ~
wget wget https://github.com/operasoftware/operachromiumdriver/releases/download/v.2.40/operadriver_linux64.zip
unzip operadriver_linux64.zip
rm operadriver_linux64.zip
mv operadriver_linux64/operadriver /root
chmod +x operadriver
```
3) Change the location of the gecko binary path if necessary in nutch-default.xml or nutch-site.xml by specifying
the value for `selenium.grid.binary`
# Run a test
1) Set the value for `selenium.driver` in `conf/nutch-site.xml` to the selenium driver you want to test
2) If you don't have a screen being attached to the server, set `selenium.enable.headless` to `true`
3) crawl
```
# connect to the nutch container
docker exec -it nutch bash
# execute the crawl
/root/nutch/bin/crawl -i -D solr.server.url=http://solr:8983/solr/mycore -s urls crawler 1
```
4) check the result
- Test your result in Solr by opening in your browser:
localhost:8983/
- navigate to the created node `mycore`,
- execute the default query fetch:
```
*:*
```
# Hints
Regarding the redirects: if you want to follow redirects immediately in the fetcher you simply could adjust `http.redirect.max` (e.g., set it to 3) and Fetcher will follow the redirects immediately.
Btw., for quick testing you could just set the required parameters in the command-line, e.g.:
```
% bin/nutch parsechecker -Dplugin.includes='protocol-selenium|parse-tika' \
-Dselenium.grid.binary=.../geckodriver \
-Dselenium.enable.headless=true \
-followRedirects \
-dumpText https://nutch.apache.org
```
## License
[](https://github.com/sbatururimi/nutch-test/blob/master/LICENSE.md)