An open API service indexing awesome lists of open source software.

https://github.com/commoncrawl/cc-nutch-example

Apache Nutch example project to archive content in WARC files
https://github.com/commoncrawl/cc-nutch-example

Last synced: about 1 year ago
JSON representation

Apache Nutch example project to archive content in WARC files

Awesome Lists containing this project

README

          

Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files
==================================================================================

A short description how to set up [Common Crawl's
Fork](https://github.com/commoncrawl/nutch/) of [Apache
Nutch](https://nutch.apache.org/) for crawling and to store the
crawled content in WARC files.

# Requirements and installation

- Linux (tested on Ubuntu 20.04)
- Java 11 (higher Java versions should also work)
- [ant](https://ant.apache.org/) and [maven](https://maven.apache.org/)
- [Compact Language Detector 2](https://github.com/CLD2Owners/cld2)

```bash
sudo apt install libcld2-0 libcld2-dev ant maven
```

# Compile Nutch and required projects

```bash
git clone git@github.com:crawler-commons/crawler-commons.git
cd crawler-commons/
mvn install
cd ..

git clone git@github.com:commoncrawl/language-detection-cld2.git
cd language-detection-cld2/
mvn install
cd ..

git clone https://github.com/commoncrawl/nutch.git nutch-cc
cd nutch-cc/
ant runtime
cd ..
```

# Configuration

Go to the project root folder `nutch-cc` and edit the files in the
folder `conf/` esp. `conf/nutch-site.xml`. But also the URL filter
configuration files may require to be adapted to your use case.

Notes:
- it's required to configure at least the property `http.agent.name`
in the file `conf/nutch-site.xml`
- if the configuration is changed Nutch needs to be recompiled because
configuration files are contained in the job file (`runtime/local/apache-nutch-*.job`)

# Run crawl

```bash
echo -e "https://nutch.apache.org/\tnutch.score=1.0" >urls.txt

./crawl.sh crawl 3 urls.txt
```