https://github.com/commoncrawl/cc-nutch-example
Apache Nutch example project to archive content in WARC files
https://github.com/commoncrawl/cc-nutch-example
Last synced: about 1 year ago
JSON representation
Apache Nutch example project to archive content in WARC files
- Host: GitHub
- URL: https://github.com/commoncrawl/cc-nutch-example
- Owner: commoncrawl
- License: apache-2.0
- Created: 2020-07-13T15:50:46.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-12-22T12:32:33.000Z (over 3 years ago)
- Last Synced: 2024-05-07T18:23:39.076Z (about 2 years ago)
- Language: Shell
- Size: 7.81 KB
- Stars: 3
- Watchers: 4
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files
==================================================================================
A short description how to set up [Common Crawl's
Fork](https://github.com/commoncrawl/nutch/) of [Apache
Nutch](https://nutch.apache.org/) for crawling and to store the
crawled content in WARC files.
# Requirements and installation
- Linux (tested on Ubuntu 20.04)
- Java 11 (higher Java versions should also work)
- [ant](https://ant.apache.org/) and [maven](https://maven.apache.org/)
- [Compact Language Detector 2](https://github.com/CLD2Owners/cld2)
```bash
sudo apt install libcld2-0 libcld2-dev ant maven
```
# Compile Nutch and required projects
```bash
git clone git@github.com:crawler-commons/crawler-commons.git
cd crawler-commons/
mvn install
cd ..
git clone git@github.com:commoncrawl/language-detection-cld2.git
cd language-detection-cld2/
mvn install
cd ..
git clone https://github.com/commoncrawl/nutch.git nutch-cc
cd nutch-cc/
ant runtime
cd ..
```
# Configuration
Go to the project root folder `nutch-cc` and edit the files in the
folder `conf/` esp. `conf/nutch-site.xml`. But also the URL filter
configuration files may require to be adapted to your use case.
Notes:
- it's required to configure at least the property `http.agent.name`
in the file `conf/nutch-site.xml`
- if the configuration is changed Nutch needs to be recompiled because
configuration files are contained in the job file (`runtime/local/apache-nutch-*.job`)
# Run crawl
```bash
echo -e "https://nutch.apache.org/\tnutch.score=1.0" >urls.txt
./crawl.sh crawl 3 urls.txt
```