Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/erikgartner/prometheus-cc-extractor

This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.
https://github.com/erikgartner/prometheus-cc-extractor

big-data common-crawl data-extraction mapreduce spark

Last synced: 9 days ago
JSON representation

This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.

Host: GitHub
URL: https://github.com/erikgartner/prometheus-cc-extractor
Owner: ErikGartner
Created: 2017-03-20T09:47:58.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2017-03-21T13:12:12.000Z (almost 8 years ago)
Last Synced: 2024-12-13T10:49:18.427Z (2 months ago)
Topics: big-data, common-crawl, data-extraction, mapreduce, spark
Language: Python
Homepage:
Size: 173 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Prometheus Common Crawl Extractors
*This repository contains mapreduce extractors to preprocess and extract websites
from the common crawl corpus.*

You may use the `mrjob.conf` to configure running the jobs on AWS EMR.

## Installation
The original ccmrjob repo uses Python 2.7 however this has been upgraded to Python 3. That entails using a different library to read the warc files.

For Python 3:
```
python3 -m venv venv
source venv/bin/activate
pip install -r requirements_python3.text
```

To do local testing the `get-data.sh` script downloads 100 WET files for testing purpose.
It uses [httpie](https://httpie.org/#installation) for downloading, so either install that or change the script to use cURL or wget.

```
./get-data.sh input/test-100.wet
```

## Extractors

### Obama Born Extractor
This simple extractors finds documents containing a regex specifing "obama born in".

Locally test using:
```
python obama_born_extractor.py --conf-path mrjob.conf --no-output --output-dir out input/test-1.wet
```