Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/erikgartner/prometheus-cc-extractor
This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.
https://github.com/erikgartner/prometheus-cc-extractor
big-data common-crawl data-extraction mapreduce spark
Last synced: 9 days ago
JSON representation
This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.
- Host: GitHub
- URL: https://github.com/erikgartner/prometheus-cc-extractor
- Owner: ErikGartner
- Created: 2017-03-20T09:47:58.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2017-03-21T13:12:12.000Z (almost 8 years ago)
- Last Synced: 2024-12-13T10:49:18.427Z (2 months ago)
- Topics: big-data, common-crawl, data-extraction, mapreduce, spark
- Language: Python
- Homepage:
- Size: 173 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Prometheus Common Crawl Extractors
*This repository contains mapreduce extractors to preprocess and extract websites
from the common crawl corpus.*You may use the `mrjob.conf` to configure running the jobs on AWS EMR.
## Installation
The original ccmrjob repo uses Python 2.7 however this has been upgraded to Python 3. That entails using a different library to read the warc files.For Python 3:
```
python3 -m venv venv
source venv/bin/activate
pip install -r requirements_python3.text
```To do local testing the `get-data.sh` script downloads 100 WET files for testing purpose.
It uses [httpie](https://httpie.org/#installation) for downloading, so either install that or change the script to use cURL or wget.```
./get-data.sh input/test-100.wet
```## Extractors
### Obama Born Extractor
This simple extractors finds documents containing a regex specifing "obama born in".Locally test using:
```
python obama_born_extractor.py --conf-path mrjob.conf --no-output --output-dir out input/test-1.wet
```