https://github.com/commoncrawl/cc-nutch-example

Apache Nutch example project to archive content in WARC files
https://github.com/commoncrawl/cc-nutch-example

Last synced: about 1 year ago
JSON representation

Apache Nutch example project to archive content in WARC files

Host: GitHub
URL: https://github.com/commoncrawl/cc-nutch-example
Owner: commoncrawl
License: apache-2.0
Created: 2020-07-13T15:50:46.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2022-12-22T12:32:33.000Z (over 3 years ago)
Last Synced: 2024-05-07T18:23:39.076Z (about 2 years ago)
Language: Shell
Size: 7.81 KB
Stars: 3
Watchers: 4
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          Example Usage of Common Crawl's Fork of Apache Nutch to Crawl and Write WARC files

==================================================================================

A short description how to set up [Common Crawl's

Fork](https://github.com/commoncrawl/nutch/) of [Apache

Nutch](https://nutch.apache.org/) for crawling and to store the

crawled content in WARC files.

# Requirements and installation

- Linux (tested on Ubuntu 20.04)

- Java 11 (higher Java versions should also work)

- [ant](https://ant.apache.org/) and [maven](https://maven.apache.org/)

- [Compact Language Detector 2](https://github.com/CLD2Owners/cld2)

```bash

sudo apt install libcld2-0 libcld2-dev ant maven

```

# Compile Nutch and required projects

```bash

git clone git@github.com:crawler-commons/crawler-commons.git

cd crawler-commons/

mvn install

cd ..

git clone git@github.com:commoncrawl/language-detection-cld2.git

cd language-detection-cld2/

mvn install

cd ..

git clone https://github.com/commoncrawl/nutch.git nutch-cc

cd nutch-cc/

ant runtime

cd ..

```

# Configuration

Go to the project root folder `nutch-cc` and edit the files in the

folder `conf/` esp. `conf/nutch-site.xml`.  But also the URL filter

configuration files may require to be adapted to your use case.

Notes:

- it's required to configure at least the property `http.agent.name`

  in the file `conf/nutch-site.xml`

- if the configuration is changed Nutch needs to be recompiled because

  configuration files are contained in the job file (`runtime/local/apache-nutch-*.job`)

# Run crawl

```bash

echo -e "https://nutch.apache.org/\tnutch.score=1.0" >urls.txt

./crawl.sh crawl 3 urls.txt

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/commoncrawl/cc-nutch-example

Awesome Lists containing this project

README