https://github.com/mozilla-services/data-pipeline

Mozilla Services Data Pipeline
https://github.com/mozilla-services/data-pipeline

Last synced: 3 months ago
JSON representation

Mozilla Services Data Pipeline

Host: GitHub
URL: https://github.com/mozilla-services/data-pipeline
Owner: mozilla-services
License: mpl-2.0
Archived: true
Created: 2015-01-15T17:34:58.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2019-03-28T12:10:32.000Z (over 6 years ago)
Last Synced: 2024-11-04T05:33:47.492Z (8 months ago)
Language: Lua
Size: 14.3 MB
Stars: 30
Watchers: 37
Forks: 22
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

awesome-starred - mozilla-services/data-pipeline - Mozilla Services Data Pipeline (others)

README

# Mozilla Services Data Pipeline

This repository contains the extra bits and pieces needed to build heka
for use in the [Cloud Services Data Pipeline](https://wiki.mozilla.org/CloudServices/DataPipeline).

Visit us on irc.mozilla.org in `#datapipeline`.

## Building a Data Pipeline RPM

Run `bash bin/build_pipeline_heka.sh` from the top level of this repo to build a heka RPM.

## Using the Data Pipeline

If you are simply looking to test out some data analysis plugins and don't want to setup your own pipeline here is the fastest way to get going:
https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Using+the+sandbox+manager+in+the+prod+prototype+pipeline

## Running/Testing Your Own Data Pipeline

You can set up a bare-bones data pipeline of your own. You will get an endpoint that listens for HTTP POST requests, performs GeoIP lookups, and wraps them up in protobuf messages. These messages will be relayed to a stream-processor, and will be output to a local store on disk. There will be basic web-based monitoring, and the ability to add your own stream processing filters.

1. Clone this data-pipeline github repo

```
git clone https://github.com/mozilla-services/data-pipeline.git
```

2. Build and configure heka. If you are unable to build heka, drop by #datapipeline on irc.mozilla.org and we will try to provide you a pre-built version.
1. Make sure you have the depencies installed:
1. OpenSSL v1.0+ (required by lua_openssl)
2. libpq, the PostgreSQL API
2. Run `bash bin/build_pipeline_heka.sh`
3. Install lua modules

```
mkdir lua_modules
rsync -av build/heka/build/heka/lib/luasandbox/modules/ lua_modules/
rsync -av hindsight/modules/ hindsight/io_modules/ lua_modules/
```

4. Procure a `GeoLiteCity.dat` file and put it in the current dir

```
wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz
gunzip GeoLiteCity.dat.gz
```

3. Set up the main Pipeline using the `examples/basic_local_pipeline.toml` config file. This will listen for HTTP POSTs on port 8080, log the raw and decoded messages requests to stdout, run the example filter, and output the records to a file.

```
build/heka/build/heka/bin/hekad -config examples/basic_local_pipeline.toml
```

4. Check the monitoring dashboard at [http://localhost:4352](http://localhost:4352)
5. Fire off some test submissions!

```
for f in $(seq 1 20); do
curl -X POST "http://localhost:8080/submit/test/$f/foo/bar/baz" -d "{\"test\":$f}"
done
```

6. Verify that your data was stored in the output file using the `heka-cat` utility

```
build/heka/build/heka/bin/heka-cat data_raw.out
build/heka/build/heka/bin/heka-cat data_decoded.out
```

7. Experiment with sandbox filters, outputs, and configurations.

### Useful things to know

- GeoIP
- It’s not terribly interesting to do GeoIP lookups on 127.0.0.1, so you may want to provide a `--header "X-Forwarded-For: 8.8.8.8"` argument to your curl commands. That will force a geoIP lookup on the specified IP address (Google’s DNS server in this example).
- How to configure namespaces
- The example config allows submissions to either `/submit/telemetry/docid/more/path/stuff` or `/submit/test/id/and/so/on`
- You can add more endpoints by modifying the `namespace_config` parameter in `basic_local_pipeline.edge.toml`.
- The namespace config is more manageable if you the JSON in a separate file, and run it through something like `jq -c '.' < my_namespaces.json` before putting it into the toml config.
- Where to get more info about configuring heka
- http://hekad.readthedocs.org/en/latest/index.html

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mozilla-services/data-pipeline

Awesome Lists containing this project

README