https://github.com/rhsimplex/commoncrawljob
Extract data from common crawl using elastic map reduce
https://github.com/rhsimplex/commoncrawljob
Last synced: about 1 year ago
JSON representation
Extract data from common crawl using elastic map reduce
- Host: GitHub
- URL: https://github.com/rhsimplex/commoncrawljob
- Owner: rhsimplex
- License: apache-2.0
- Created: 2016-04-03T21:02:39.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2016-04-03T21:05:34.000Z (about 10 years ago)
- Last Synced: 2025-01-28T05:27:25.533Z (over 1 year ago)
- Language: Python
- Homepage:
- Size: 122 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
Common Crawl Data Extraction
============================
Extract data from common crawl using elastic map reduce
Note: This project uses Python 2.7.11
CommonCrawlJob is a framework which wraps the ``MRJob`` hadoop library for streaming
analytics over internet scale data.
For more information on using `MRJob`_ framework.
Setup
-----
To develop locally, you will need to install the ``mrjob`` Hadoop
streaming framework library, and the ``boto`` library for accessing amazon cloud
public dataset resources.
Use pip to install these libraries.
.. code:: sh
$ pip install CommonCrawlJob
Getting Started
---------------
To first get started, we are going to create a Google Analytics extractor. We will go from start to
finish in creating a Common Crawl extractor that uses regular expression capture groups to extract
google analytics tracker id's.
First let's create a file ``GoogleAnalytics.py``.
.. code:: sh
$ touch GoogleAnalytics.py
Using a text editor, write to this file
.. code:: python
import re
from ccjob import CommonCrawl
class GATagJob(CommonCrawl):
def process_record(self, body):
# Regular Expression for Google Analytics Tracker
pat = re.compile(r"[\"\']UA-(\d+)-(\d)+[\'\"]")
for match in pat.finditer(body):
if match:
yield match.groups()[0]
self.increment_counter('commoncrawl', 'processed_document', 1)
if __name__ == '__main__':
GATagJob.run()
Our ``GATagJob`` class has one method ``process_record`` taking in one argument containing
the body of a HTML file and yields the results matching our regular expression.
All common crawl jobs will generally obey this pattern.
Testing Locally
---------------
Run the Google Analytics extractor locally to test your script.
.. code:: sh
$ python GoogleAnalytics.py -r local <(tail -n 1 data/latest.txt)
Region Configuration
--------------------
For best performance, you should launch the cluster in the same region
as your data. Currently data from `aws-publicdatasets`_ are stored in
``us-east-1``, which is where you want to point your EMR cluster.
Common Crawl Region
-------------------
:S3: US Standard
:EMR: US East (N. Virginia)
:API: ``us-east-1``
Create an Amazon EC2 Key Pair and PEM File
------------------------------------------
Amazon EMR uses an Amazon Elastic Compute Cloud (Amazon EC2) key pair
to ensure that you alone have access to the instances that you launch.
The PEM file associated with this key pair is required to ssh directly to the master node of the cluster.
To create an Amazon EC2 key pair:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. Go to the Amazon EC2 console
2. In the Navigation pane, click Key Pairs
3. On the Key Pairs page, click Create Key Pair
4. In the Create Key Pair dialog box, enter a name for your key pair, such as, mykeypair
5. Click Create
6. Save the resulting PEM file in a safe location
Configuring ``mrjob.conf``
--------------------------
Make sure to download an EC2 Key Pair ``pem`` file for your map reduce
job and add it to the ``ec2_key_pair`` and ``ec2_key_pair_file``
variables.
Make sure that the ``PEM`` file has permissions set properly by running
.. code:: sh
$ chown 600 $MY_PEM_FILE
Download the latest version of python to send to your EMR instances.
.. code:: sh
$ wget https://www.python.org/ftp/python/2.7.11/Python-2.7.11.tgz
Create a ``mrjob.conf`` file to set up your configuration parameters to match
that of AWS.
There is a default configuration template located at ``mrjob.conf.template`` that you can use.
.. code:: yaml
runners:
emr:
aws_region: 'us-east-1'
aws_access_key_id:
aws_secret_access_key:
cmdenv:
AWS_ACCESS_KEY_ID:
AWS_SECRET_ACCESS_KEY:
ec2_key_pair:
ec2_key_pair_file:
ssh_tunnel_to_job_tracker: true
ec2_instance_type: 'm1.xlarge'
ec2_master_instance_type: 'm1.xlarge'
emr_tags:
name: ''
num_ec2_instances: 12
ami_version: '2.4.10'
python_bin: python2.7
interpreter: python2.7
bootstrap_action:
- s3://elasticmapreduce/bootstrap-actions/install-ganglia
upload_files:
- CommonCrawl.py
bootstrap:
- tar xfz Python-2.7.11.tgz#
- cd Python-2.7.11
- ./configure && make && sudo make install
- sudo python2.7 get-pip.py#
- sudo pip2 install --upgrade pip setuptools wheel
- sudo pip2 install -r requirements.txt#
Run on Amazon Elastic MapReduce
-------------------------------
First copy the ``mrjob.conf.template`` into ``mrjob.conf``
Note: > Make sure to fill out the necessary AWS credentials with your
information
.. code:: sh
$ python GoogleAnalytics.py -r emr \
--conf-path="mrjob.conf" \
--output-dir="s3n://$S3_OUTPUT_BUCKET" \
data/arcindex.txt
.. _MRJob: https://pythonhosted.org/mrjob/
.. _aws-publicdatasets: https://aws.amazon.com/public-data-sets/