{"id":22546611,"url":"https://github.com/prx/analytics-ingest-lambda","last_synced_at":"2025-03-28T08:45:56.972Z","repository":{"id":38862870,"uuid":"82971433","full_name":"PRX/analytics-ingest-lambda","owner":"PRX","description":"(will think of a better name later)","archived":false,"fork":false,"pushed_at":"2025-03-05T15:19:33.000Z","size":665,"stargazers_count":0,"open_issues_count":9,"forks_count":0,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-03-05T16:29:29.596Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PRX.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-02-23T21:06:21.000Z","updated_at":"2025-03-05T15:19:34.000Z","dependencies_parsed_at":"2024-08-12T16:00:42.814Z","dependency_job_id":"13cb5db9-2d74-49f8-be6c-d879a5e2f828","html_url":"https://github.com/PRX/analytics-ingest-lambda","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRX%2Fanalytics-ingest-lambda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRX%2Fanalytics-ingest-lambda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRX%2Fanalytics-ingest-lambda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRX%2Fanalytics-ingest-lambda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PRX","download_url":"https://codeload.github.com/PRX/analytics-ingest-lambda/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245999320,"owners_count":20707554,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-07T15:08:21.926Z","updated_at":"2025-03-28T08:45:56.956Z","avatar_url":"https://github.com/PRX.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PRX Metrics Ingest\n\nLambda to process metrics data coming from one or more kinesis streams, and\nsend that data to multiple destinations.\n\n# Description\n\nThe lambda subscribes to kinesis streams, containing metric event records. These\nvarious metric records are either recognized by an input source in `lib/inputs`,\nor ignored and logged as a warning at the end of the lambda execution.\n\nBecause of differences in retry logic, this repo\nis actually deployed as **3 different lambdas**, subscribed to one or more kinesis streams.\n\n## BigQuery\n\nRecords with type `postbytes` will be parsed\ninto BigQuery table formats, and inserted into their corresponding BigQuery\ntables in parallel. This is called [streaming inserts](https://cloud.google.com/bigquery/streaming-data-into-bigquery),\nand in case the insert fails, it will be attempted 2 more times before the Lambda\nfails with an error. And since each insert includes a unique `insertId`, we\ndon't have any data consistency issues with re-running the inserts.\n\nBigQuery now supports partitioning based on a [specific timestamp field](https://cloud.google.com/bigquery/docs/partitioned-tables#partitioned_tables),\nso any inserts streamed to a table will be automatically moved to the correct\ndaily partition.\n\n## Pingbacks\n\nRecords with type `postbytes` and an `impressions[]` array will POST those\nimpressions count to the [Dovetail Router](https://github.com/PRX/dovetail-router.prx.org)\nFlight Increments API, at `/api/v1/flight_increments/:date`. This gives some\nsemblance of live flight-impression counts so we can stop serving flights as\nclose to their goals as possible.\n\nAdditionally, records with a special `impression[].pings` array will be pinged via\nan HTTP GET. This \"ping\" does follow redirects, but expects to land on a 200\nresponse afterwards. Although 500 errors will be retried internally in the\ncode, any ping failures will be allowed to fail after error/timeout.\n\nUnlike BigQuery, these operations are not idempotent, so we don't want to\nover-ping a url. All errors will be handled internally so Kinesis doesn't\nattempt to re-exec the batch of records.\n\n### URI Templates\n\nPingback urls should be valid [RFC 6570](https://tools.ietf.org/html/rfc6570) URI\ntemplate. Valid parameters are:\n\n| Parameter Name    | Description                                                                                     |\n| ----------------- | ----------------------------------------------------------------------------------------------- |\n| `ad`              | Ad id (intersection of creative and flight)                                                     |\n| `agent`           | Requester user-agent string                                                                     |\n| `agentmd5`        | An md5'd user-agent string                                                                      |\n| `episode`         | Feeder episode guid                                                                             |\n| `campaign`        | Campaign id                                                                                     |\n| `creative`        | Creative id                                                                                     |\n| `flight`          | Flight id                                                                                       |\n| `ip`              | Request ip address                                                                              |\n| `ipmask`          | Masked ip, with the last octet changed to 0s                                                    |\n| `listener`        | Unique string for this \"listener\"                                                               |\n| `listenerepisode` | Unique string for \"listener + url\"                                                              |\n| `podcast`         | Feeder podcast id                                                                               |\n| `randomstr`       | Random string                                                                                   |\n| `randomint`       | Random integer                                                                                  |\n| `referer`         | Requester http referer                                                                          |\n| `timestamp`       | Epoch milliseconds of request                                                                   |\n| `url`             | Full url of request, including host and query parameters, but _without_ the protocol `https://` |\n\n## DynamoDB\n\nWhen a listener requests an episode from [Dovetail Router](https://github.com/PRX/dovetail-router.prx.org),\nit will emit kinesis records of type `antebytes`. Meaning\nthe bytes haven't been downloaded yet. These records are inserted into DynamoDB,\nand saved until the CDN-bytes are actually downloaded.\n\nThis lambda also picks up type `bytes` and `segmentbytes` records, meaning that\nthe [dovetail-counts-lambda](https://github.com/PRX/dovetail-counts-lambda) has\ndecided enough of the segment/file-as-a-whole has been downloaded to be counted.\n\nAs both of those records are keyed by the `\u003clistener_episode\u003e.\u003cdigest\u003e` of the\nrequest, we avoid a race condition by waiting for _both_ to be present before\nlogging the real download/impressions. Some example DynamoDB data:\n\n```\n+-----------+-----------------------+-------------------------+\n| id        | payload               | segments                |\n+-----------+-----------------------+-------------------------+\n| 1234.abcd | \u003cbinary gzipped json\u003e | 1624299980 1624299942.2 |\n| 1234.efgh |                       | 1624300094.1            |\n| 5678.efgh | \u003cbinary gzipped json\u003e |                         |\n+-----------+-----------------------+-------------------------+\n```\n\nThe `segments` [String Set](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes)\ncontains the epoch `timestamp` that came in on each `byte` or `segmentbyte`\nrecord (the time the bytes were actually downloaded from the CDN). And\noptionally a `.` and the segment number. This field acts as a gatekeeper, so we\nnever double-count the same `bytes/segmentbytes` on the same UTC day.\n\n(**NOTE:** a single `antebytes` record _could_ legally be counted twice on 2\ndifferent UTC days, if the listener downloaded the episode from the CDN twice\njust before and after midnight).\n\nOnce we decide to count a segment impression or overall download, the original\n`antebytes` is unzipped from the `payload`, we change the type of the record\nto `postbytes` and the timestamp to match when the CDN bytes were downloaded,\nthen re-emit the record to kinesis.\n\nThese `postbytes` records are then processed by the previous 2 lambdas.\n\n## Frequency Impressions\n\nRecords with type `postbytes` will have their impressions looked at and if\nthere is a frequency cap, then the impression will be recorded to DynamoDB\nto allow Dovetail Router to check how many impressions exist already for this\ncampaign and listener.\n\n# Installation\n\nTo get started, just run `yarn`.\n\n## Unit Tests\n\nAnd hey, to just run the unit tests locally, you don't need anything! Just\n`yarn test` to your heart's content.\n\nThere are some dynamodb tests that use an actual table, and will be skipped. To\nalso run these, set `TEST_DDB_TABLE` and `TEST_DDB_ROLE` to something in AWS you\nhave access to.\n\n## Integration Tests\n\nThe integration test simply runs the lambda function against a test-event (the\nsame way you might in the lambda web console), and outputs the result.\n\nCopy `env-example` to `.env`, and fill in your information. Now when you run\n`yarn start`, you should see the test event run 3 times, and do some work for\nall of the lambda functions.\n\n## BigQuery\n\nTo enable BigQuery inserts, you'll need to first [create a Google Cloud Platform Project](https://cloud.google.com/resource-manager/docs/creating-managing-projects),\ncreate a BigQuery dataset, and create the tables referenced by your `lib/inputs`.\nSorry -- no help on creating the correct table scheme yet!\n\nThen [create a Service Account](https://developers.google.com/identity/protocols/OAuth2ServiceAccount#creatinganaccount) for this app. Make sure it has BigQuery Data Editor permissions.\n\n## DynamoDB\n\nTo enable DynamoDB gets/writes, you'll need to setup a [DynamoDB table](https://docs.aws.amazon.com/dynamodb/index.html#lang/en_us)\nthat your account has access to. You can use your local AWS cli credentials, or\nsetup AWS client/secret environment variables.\n\nYou can also optionally access a DynamoDB table in a different account by specifying\na `DDB_ROLE` that the lambda should assume while doing gets/writes.\n\n# Deployment\n\nThe 3 lambdas functions are deployed via a Cloudformation stack in the [Infrastructure repo](https://github.com/PRX/Infrastructure/blob/master/stacks/apps/dovetail-analytics.yml):\n\n- `AnalyticsBigqueryFunction` - insert downloads/impressions into BigQuery\n- `AnalyticsPingbacksFunction` - increment flight impressions and 3rd-party pingbacks\n- `AnalyticsDynamoDbFunction` - temporary store for IAB compliant downloads\n\n# Docker\n\nThis repo is now dockerized!\n\n```\ndocker-compose build\ndocker-compose run test\ndocker-compose run start\n```\n\nAnd you can easily-ish get the lambda zip built by the Dockerfile:\n\n```\ndocker ps -a | grep analyticsingestlambda\ndocker cp {{container-id-here}}:/app/build.zip myzipfile.zip\nunzip -l myzipfile.zip\n```\n\n# License\n\n[AGPL License](https://www.gnu.org/licenses/agpl-3.0.html)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprx%2Fanalytics-ingest-lambda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprx%2Fanalytics-ingest-lambda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprx%2Fanalytics-ingest-lambda/lists"}