Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mbta/data_platform
https://github.com/mbta/data_platform
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/mbta/data_platform
- Owner: mbta
- License: mit
- Created: 2021-11-30T15:32:34.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2024-08-06T13:12:24.000Z (5 months ago)
- Last Synced: 2024-08-06T20:17:26.132Z (5 months ago)
- Language: Elixir
- Size: 618 KB
- Stars: 2
- Watchers: 12
- Forks: 1
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Local
### Setup
Run the following:
```sh
asdf plugin-add adr-tools
asdf plugin-add elixir
asdf plugin-add erlang
asdf plugin-add java
asdf plugin-add poetry
asdf plugin-add python
asdf plugin-add terraform
asdf install
```### Environment
**Note:** Some local, but sensitive, information is stored in 'App: Data Platform' 1Password Vault.
Please copy `.env.template` to `.env` and make the following updates.
Replace `{s3_bucket}` with a S3 bucket you have access to. The Data Platform team has a default one, so feel free to ask what it is and how to get access to it.
Replace `{username}` with your AWS username, ex. `ggjura`.
```
# buckets
S3_BUCKET_OPERATIONS={s3_bucket}
S3_BUCKET_INCOMING={s3_bucket}
S3_BUCKET_ARCHIVE={s3_bucket}
S3_BUCKET_ERROR={s3_bucket}
S3_BUCKET_SPRINGBOARD={s3_bucket}
# prefixes
S3_BUCKET_PREFIX_OPERATIONS={username}/operations/
S3_BUCKET_PREFIX_INCOMING={username}/incoming/
S3_BUCKET_PREFIX_ARCHIVE={username}/archive/
S3_BUCKET_PREFIX_ERROR={username}/error/
S3_BUCKET_PREFIX_SPRINGBOARD={username}/springboard/
```If you have setup a local infrastructure (see [this](https://github.com/mbta/data_platform/blob/main/terraform/README.md)), then you can update the following accordingly.
**Note:** This configuration is NOT required that it'd be set.
```
# glue
GLUE_DATABASE_INCOMING={username}_incoming
GLUE_DATABASE_SPRINGBOARD={username}_springboard
GLUE_JOB_CUBIC_INGESTION_INGEST_INCOMING={username}_cubic_ingestion_ingest_incoming
```For the following, the Data Platform team will need to provide you with the `{dmap_base_url}` and `{dmap_api_key}`.
**Note:** This configuration is NOT required that it'd be set.
```
# cubic dmap
CUBIC_DMAP_BASE_URL={dmap_base_url}
CUBIC_DMAP_API_KEY={dmap_api_key}
```### Docker
To build and stand up the database and glue containers:
```sh
# start docker, and then
docker-compose up
```To login into database:
```sh
# assuming `docker-compose up`
docker exec -it db__local bash
# in docker bash
psql -U postgres -d data_platform
```To run glue jobs:
```sh
# ex.
docker-compose run --rm glue_3_0__local /glue/bin/gluesparksubmit /data_platform/aws/s3/glue_jobs/{glue_script_name}.py --JOB_NAME {glue_job_name} [--ARGS "..."]
```### App: ex_cubic_ingestion
Run the following to allow for this application to run locally:
```sh
cd ex_cubic_ingestion
mix deps.get
mix ecto.migrate
```You should then be able to run the application with:
```sh
iex -S mix
```### App: py_cubic_ingestion
Run the following to allow for this application to run locally:
```
cd py_cubic_ingestion
poetry install
```You should then be able to run the application with:
```sh
docker-compose run --rm glue_3_0__local /glue/bin/gluesparksubmit /data_platform/aws/s3/glue_jobs/cubic_ingestion/ingest_incoming.py --JOB_NAME cubic_ingestion_ingest_incoming --ENV "..." --INPUT "..."
```# Folder Structure
### aws
The `s3/` folder within this folder contains the files that will be synced up to S3 during a `glue-python-deploy` CI run. Additionally the `s3/glue_jobs/` contains the glue jobs' code as it will be run by AWS Glue.
### doc
The `adr/` here contains the the various architectural decisions made over the course of the Data Platform's development. Further documentation can be found in [Notion](https://www.notion.so/mbta-downtown-crossing/Data-Platform-9f78ea9ad675432c87ab08d6d38280c2).
### docker
Contains docker files that are used for local development of the Data Platform. These docker are separate from applications that operate various parts of the Data Platform.
### ex_cubic_ingestion
An Elixir application that runs the Cubic Ingestion process. Further documentation can be found in [Notion](https://www.notion.so/mbta-downtown-crossing/Data-Platform-9f78ea9ad675432c87ab08d6d38280c2).
### py_cubic_ingestion
A python package to hold all of the `cubic_ingestion_ingest_incoming` Glue job code, including tests and package requirements.
### sample_data
Sample data that is similar in structure to what we currently have coming into the 'Incoming' S3 bucket.
### terraform
A space for engineer's to create infrastructure that support local development. See [README](https://github.com/mbta/data_platform/blob/main/terraform/README.md).
# Links
* [Architecture Designs (Miro)](https://miro.com/app/board/o9J_liWCxTw=/)
* [Notion](https://www.notion.so/mbta-downtown-crossing/Data-Platform-9f78ea9ad675432c87ab08d6d38280c2)