{"id":21958095,"url":"https://github.com/lisad/phaser-example","last_synced_at":"2025-04-23T16:24:07.311Z","repository":{"id":240781300,"uuid":"733686435","full_name":"lisad/phaser-example","owner":"lisad","description":"Example of project that uses the phaser library to collect, merge and transform data","archived":false,"fork":false,"pushed_at":"2025-03-03T02:03:51.000Z","size":667,"stargazers_count":3,"open_issues_count":2,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-23T16:24:05.372Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lisad.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-12-19T22:30:31.000Z","updated_at":"2025-03-05T00:37:40.000Z","dependencies_parsed_at":"2024-05-22T16:31:06.961Z","dependency_job_id":"9d27327e-39d6-43b8-aec1-edc8afecb24d","html_url":"https://github.com/lisad/phaser-example","commit_stats":null,"previous_names":["lisad/phaser-example"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lisad%2Fphaser-example","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lisad%2Fphaser-example/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lisad%2Fphaser-example/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lisad%2Fphaser-example/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lisad","download_url":"https://codeload.github.com/lisad/phaser-example/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250468580,"owners_count":21435511,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-29T08:59:40.510Z","updated_at":"2025-04-23T16:24:07.298Z","avatar_url":"https://github.com/lisad.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# phaser-example\n\nThis repository has working examples of projects that use the phaser library to collect, merge and transform data.\n\n## To try yourself\n\nSet up (using Python 3):\n\n```\n\n\u003e python3 -m venv venv\n\u003e source venv/bin/activate\n\u003e pip install -r requirements.txt\n\u003e pip install phaser\n\n```    \n\nIf phaser is locally cloned from the phaser repository, install it with `pip install -e ../phaser` (or use the \nappropriate relative directory path)\n\nTo run the bike count pipelines:\n```    \n\u003e cd bikecounts\n\u003e mkdir output\n\u003e python3 -m phaser run boston output  sources/boston_bike_ped_counts.csv\n\u003e python3 -m phaser run seattle output \"sources/Seattle Burke Gilman Trail NE 70th 2024.csv\"\n\u003e python3 -m phaser run seattle output \"sources/Seattle Thomas St Overpass 2024.csv\"\n```\n\nTODO - instructions for the continuous glucose monitoring data\n\n## Desired output for bike data \n\nThe output should have one row per count value per timestamp, with these column values all filled in:\n\n* location_id: location ID of bike/ped counter sensor provided by city\n* latitude, longitude\n* count_id: ID provided by city for the day's count values?\n* description: descriptive name of the location of the counter\n* municipality: civil municipality where the counter resides\n* count: number of bikes counted and registered at this time\n* counted_at: timestamp of the count value\n\nStill to add:\n* Timezone\n\n## Overview of Boston pipeline\n\nThe boston bike and pedestrian count data looks like this: \n\n* BP_LOC_ID: location id\n* LATITUDE, LONGITUDE\n* COUNT_ID: unique identifier for this count record which will contain many values in many columns\n* MUNICIPALITY\n* FACILITY_NAME\n* CNT_LOC_DESCRIPTION\n* CNT_DESCRIPTION\n* TEMPERATURE\n* SKY\n* COUNT_TYPE:  \"B\" for bike, \"P\" for pedestrian, etc.\n* Additional fields describing streets, directions of traffic\n* COUNT_DATE\n* 58 columns named after a time of day, e.g. 'CNT_0630', 'CNT_0645', 'CNT_0700' etc.  The values in\n  these columns are the counts for those 15 minute periods e.g. from 6:30 to 6:45.\n\nThe challenging thing in working with this data is turning it into individual timestamped counts which would allow\nbetter totals, graphing, etc, rather than the 58 counts per day across each row.\n\nBefore this, however, we must deal with multiple count rows for the same location: direction coming into the \nlocation, and direction leaving the location (e.g. northbound on Harvard St going to westbound on Beacon st).\nFor our analysis, we add these all up for total traffic at that location in that time period.\n\n### Why the pipeline was organized in 3 phases\n\nSince phaser calculates and keeps row numbers, any phase that reshapes the data significantly by splitting out\nrows or pivoting makes those row numbers invalid. In the boston data pipeline, there are two phases that really\nchange the shape of the data - the 2nd and 3rd phases.  The first phase will be a cleanup phase.\n\nColumns are declared once for the whole pipeline, because several phases need to apply the same column value parsing\n(e.g. parsing counts as ints, and dates as dates.) THe first time the column definitions are used in the first phase,\nthis results in dropping some rows with invalid values.\n\nIn the __select-bike-counts__ phase which works first to eliminate all the data we don't want to work with (a good \npractice to avoid having to add code to work with data you don't even want), only rows with bike counts and\nonly columns we want are kept.  This is done with the __phaser__ builtin __filter_rows__ function to choose only bike\ncount rows, and a custom function to drop all columns not declared.\n\nThe __aggregate-counts__ phase adds the counts together for all the incoming and outgoing locations as those \nare all broken into separate rows with the same COUNT_ID in the source data.\n\nFinally, the __pivot-timestamps__ phase does a wide-to-long pivot, so that each count gets its own row and timestamp, \nnow ready for graphing or analysis.  TODO: the pivot-timestamps phae needs to tell the phaser library not to \nkeep row numbers or warn about extra rows created, because it's doing a pivot.\n\nThe declaration of the pipeline, columns and steps is only ~35 lines, because many of the operations are performed\nby the __phaser__ library (the rest of the lines in boston.py is mostly the pivot function).  The __phaser__ library\ntakes care of:\n\n* Making sure the int columns like CNT_0630 are all treated as integers\n* Parsing the date column\n* Dropping empty rows rather than have null values\n* Input and output\n* Collecting a summary of what was done in an 'errors_and_warnings.txt' file - e.g. how many rows dropped with\n  COUNT_TYPE other than 'B'\n\nAfter running the pipeline, the output of each phase can be seen in a checkpoint to make sure each phase separately is \ndoing its job.\n\n## Overview of Seattle pipeline\n\nProvenance: e.g.  https://data.seattle.gov/Transportation/Thomas-St-Overpass-Bike-Ped-Counter/t8i6-tipf/about_data (and\nother pages for other sensor locations - the Burke Gilman Trail NE 70th data has also been tested)\n\nSeattle organizes its sensor data very differently.  Rather than having one file for many locations, there's one file\nper location, and the location name is in the filename as well as in the column names.  Column names are different\nacross different files.  Also, the counts are separated by direction, so there must be a step to sum directions\ntogether to get a comparable value to the Boston data.  TODO: add lat/long based on location, and add temperature\nbased on lat/long and day.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flisad%2Fphaser-example","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flisad%2Fphaser-example","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flisad%2Fphaser-example/lists"}