Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/prx/analytics-ingest-lambda

(will think of a better name later)
https://github.com/prx/analytics-ingest-lambda

Last synced: 16 days ago
JSON representation

(will think of a better name later)

Host: GitHub
URL: https://github.com/prx/analytics-ingest-lambda
Owner: PRX
License: agpl-3.0
Created: 2017-02-23T21:06:21.000Z (almost 8 years ago)
Default Branch: main
Last Pushed: 2024-10-24T14:06:15.000Z (about 2 months ago)
Last Synced: 2024-10-25T17:18:47.261Z (about 2 months ago)
Language: JavaScript
Size: 610 KB
Stars: 0
Watchers: 11
Forks: 0
Open Issues: 9
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# PRX Metrics Ingest

Lambda to process metrics data coming from one or more kinesis streams, and
send that data to multiple destinations.

# Description

The lambda subscribes to kinesis streams, containing metric event records. These
various metric records are either recognized by an input source in `lib/inputs`,
or ignored and logged as a warning at the end of the lambda execution.

Because of differences in retry logic, this repo
is actually deployed as **3 different lambdas**, subscribed to one or more kinesis streams.

## BigQuery

Records with type `postbytes` will be parsed
into BigQuery table formats, and inserted into their corresponding BigQuery
tables in parallel. This is called [streaming inserts](https://cloud.google.com/bigquery/streaming-data-into-bigquery),
and in case the insert fails, it will be attempted 2 more times before the Lambda
fails with an error. And since each insert includes a unique `insertId`, we
don't have any data consistency issues with re-running the inserts.

BigQuery now supports partitioning based on a [specific timestamp field](https://cloud.google.com/bigquery/docs/partitioned-tables#partitioned_tables),
so any inserts streamed to a table will be automatically moved to the correct
daily partition.

## Pingbacks

Records with type `postbytes` and an `impressions[]` array will POST those
impressions count to the [Dovetail Router](https://github.com/PRX/dovetail-router.prx.org)
Flight Increments API, at `/api/v1/flight_increments/:date`. This gives some
semblance of live flight-impression counts so we can stop serving flights as
close to their goals as possible.

Additionally, records with a special `impression[].pings` array will be pinged via
an HTTP GET. This "ping" does follow redirects, but expects to land on a 200
response afterwards. Although 500 errors will be retried internally in the
code, any ping failures will be allowed to fail after error/timeout.

Unlike BigQuery, these operations are not idempotent, so we don't want to
over-ping a url. All errors will be handled internally so Kinesis doesn't
attempt to re-exec the batch of records.

### URI Templates

Pingback urls should be valid [RFC 6570](https://tools.ietf.org/html/rfc6570) URI
template. Valid parameters are:

| Parameter Name | Description |
| ----------------- | ----------------------------------------------------------------------------------------------- |
| `ad` | Ad id (intersection of creative and flight) |
| `agent` | Requester user-agent string |
| `agentmd5` | An md5'd user-agent string |
| `episode` | Feeder episode guid |
| `campaign` | Campaign id |
| `creative` | Creative id |
| `flight` | Flight id |
| `ip` | Request ip address |
| `ipmask` | Masked ip, with the last octet changed to 0s |
| `listener` | Unique string for this "listener" |
| `listenerepisode` | Unique string for "listener + url" |
| `podcast` | Feeder podcast id |
| `randomstr` | Random string |
| `randomint` | Random integer |
| `referer` | Requester http referer |
| `timestamp` | Epoch milliseconds of request |
| `url` | Full url of request, including host and query parameters, but _without_ the protocol `https://` |

## DynamoDB

When a listener requests an episode from [Dovetail Router](https://github.com/PRX/dovetail-router.prx.org),
it will emit kinesis records of type `antebytes`. Meaning
the bytes haven't been downloaded yet. These records are inserted into DynamoDB,
and saved until the CDN-bytes are actually downloaded.

This lambda also picks up type `bytes` and `segmentbytes` records, meaning that
the [dovetail-counts-lambda](https://github.com/PRX/dovetail-counts-lambda) has
decided enough of the segment/file-as-a-whole has been downloaded to be counted.

As both of those records are keyed by the `.` of the
request, we avoid a race condition by waiting for _both_ to be present before
logging the real download/impressions. Some example DynamoDB data:

```
+-----------+-----------------------+-------------------------+
| id | payload | segments |
+-----------+-----------------------+-------------------------+
| 1234.abcd | | 1624299980 1624299942.2 |
| 1234.efgh | | 1624300094.1 |
| 5678.efgh | | |
+-----------+-----------------------+-------------------------+
```

The `segments` [String Set](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes)
contains the epoch `timestamp` that came in on each `byte` or `segmentbyte`
record (the time the bytes were actually downloaded from the CDN). And
optionally a `.` and the segment number. This field acts as a gatekeeper, so we
never double-count the same `bytes/segmentbytes` on the same UTC day.

(**NOTE:** a single `antebytes` record _could_ legally be counted twice on 2
different UTC days, if the listener downloaded the episode from the CDN twice
just before and after midnight).

Once we decide to count a segment impression or overall download, the original
`antebytes` is unzipped from the `payload`, we change the type of the record
to `postbytes` and the timestamp to match when the CDN bytes were downloaded,
then re-emit the record to kinesis.

These `postbytes` records are then processed by the previous 2 lambdas.

## Frequency Impressions

Records with type `postbytes` will have their impressions looked at and if
there is a frequency cap, then the impression will be recorded to DynamoDB
to allow Dovetail Router to check how many impressions exist already for this
campaign and listener.

# Installation

To get started, first run `yarn`. Then run `yarn dbs` to download the
remote datacenter IP lists, and domain threat lists.

## Unit Tests

And hey, to just run the unit tests locally, you don't need anything! Just
`yarn test` to your heart's content.

There are some dynamodb tests that use an actual table, and will be skipped. To
also run these, set `TEST_DDB_TABLE` and `TEST_DDB_ROLE` to something in AWS you
have access to.

## Integration Tests

The integration test simply runs the lambda function against a test-event (the
same way you might in the lambda web console), and outputs the result.

Copy `env-example` to `.env`, and fill in your information. Now when you run
`yarn start`, you should see the test event run 3 times, and do some work for
all of the lambda functions.

## BigQuery

To enable BigQuery inserts, you'll need to first [create a Google Cloud Platform Project](https://cloud.google.com/resource-manager/docs/creating-managing-projects),
create a BigQuery dataset, and create the tables referenced by your `lib/inputs`.
Sorry -- no help on creating the correct table scheme yet!

Then [create a Service Account](https://developers.google.com/identity/protocols/OAuth2ServiceAccount#creatinganaccount) for this app. Make sure it has BigQuery Data Editor permissions.

## DynamoDB

To enable DynamoDB gets/writes, you'll need to setup a [DynamoDB table](https://docs.aws.amazon.com/dynamodb/index.html#lang/en_us)
that your account has access to. You can use your local AWS cli credentials, or
setup AWS client/secret environment variables.

You can also optionally access a DynamoDB table in a different account by specifying
a `DDB_ROLE` that the lambda should assume while doing gets/writes.

# Deployment

The 3 lambdas functions are deployed via a Cloudformation stack in the [Infrastructure repo](https://github.com/PRX/Infrastructure/blob/master/stacks/apps/dovetail-analytics.yml):

- `AnalyticsBigqueryFunction` - insert downloads/impressions into BigQuery
- `AnalyticsPingbacksFunction` - increment flight impressions and 3rd-party pingbacks
- `AnalyticsDynamoDbFunction` - temporary store for IAB compliant downloads

# Docker

This repo is now dockerized! You'll need some read-only S3 credentials in your
`.env` file for the `bin/getdatacenters.js` script to succeed during build:

```
docker-compose build
docker-compose run test
docker-compose run start
```

And you can easily-ish get the lambda zip built by the Dockerfile:

```
docker ps -a | grep analyticsingestlambda
docker cp {{container-id-here}}:/app/build.zip myzipfile.zip
unzip -l myzipfile.zip
```

# Datacenter Updates

Periodically, datacenter IP ranges should be updated in the S3 bucket they're stored in.

The CSV from [ipcat](https://github.com/client9/ipcat) (or [this fork](https://github.com/growlfm/ipcat))
can be pasted in as-is.

The list [from Amazon](https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html) contains
overlapping CIDRs, so you'll need to combine those to be compatible with `prx-ip-filter`.
(And we don't need the IPv4 ranges from them, as ipcat already has them).

```
wget https://ip-ranges.amazonaws.com/ip-ranges.json
jq '.ipv6_prefixes[].ipv6_prefix' ip-ranges.json -r > ip-ranges.csv
# pip install netaddr
cat ip-ranges.csv | python -c "exec(\"import sys\nfrom netaddr import *\ndata = sys.stdin.readlines()\nif len(data) == 1:\n data = data[0].split()\nnets = IPSet(data)\nfor cidr in nets.iter_cidrs(): print(f'{cidr},Amazon AWS')\")" > datacenters.awsv6.csv
```

# License

[AGPL License](https://www.gnu.org/licenses/agpl-3.0.html)