{"id":14974055,"url":"https://github.com/seattleflu/id3c","last_synced_at":"2025-10-27T06:30:51.809Z","repository":{"id":35157013,"uuid":"179587638","full_name":"seattleflu/id3c","owner":"seattleflu","description":"Data logistics system enabling real-time pathogen surveillance. Built for the Seattle Flu Study.","archived":false,"fork":false,"pushed_at":"2023-12-06T22:13:56.000Z","size":1826,"stargazers_count":22,"open_issues_count":11,"forks_count":5,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-02-01T03:12:33.951Z","etag":null,"topics":["cli","data-distribution","etl","fhir","id3c","infectious-disease","plpgsql","postgres","postgresql","python","redcap","rest-api","scan-study","seattle-flu-study","sqitch","sql"],"latest_commit_sha":null,"homepage":"","language":"PLpgSQL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/seattleflu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"docs/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-04-04T22:43:20.000Z","updated_at":"2023-07-17T21:12:54.000Z","dependencies_parsed_at":"2023-01-15T15:00:46.461Z","dependency_job_id":"fa093edb-0b36-409e-ac27-6345895ce9d3","html_url":"https://github.com/seattleflu/id3c","commit_stats":{"total_commits":925,"total_committers":16,"mean_commits":57.8125,"dds":0.5654054054054054,"last_synced_commit":"6c3c1a940264a0ff87bd47ee243c6099056900b1"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seattleflu%2Fid3c","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seattleflu%2Fid3c/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seattleflu%2Fid3c/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seattleflu%2Fid3c/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/seattleflu","download_url":"https://codeload.github.com/seattleflu/id3c/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238445949,"owners_count":19473845,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","data-distribution","etl","fhir","id3c","infectious-disease","plpgsql","postgres","postgresql","python","redcap","rest-api","scan-study","seattle-flu-study","sqitch","sql"],"created_at":"2024-09-24T13:49:53.247Z","updated_at":"2025-10-27T06:30:46.408Z","avatar_url":"https://github.com/seattleflu.png","language":"PLpgSQL","readme":"# ID3C: Infectious Disease Data Distribution Center\n\nData logistics system enabling real-time genomic epidemiology. Built for the\n[Seattle Flu Study](https://seattleflu.org).\n\n## Navigation\n* [Database](#database)\n* [Web API](#web-api)\n* [CLI](#cli)\n* [Development Setup](#development-setup)\n\n## Database\n\nCurrently [PostgreSQL 15](https://www.postgresql.org/about/news/postgresql-15-released-2526/).\n\nInitially aims to provide:\n\n* Access via SQL and [REST APIs](#web-api), with\n  [PostgREST](http://postgrest.org) and/or ES2018 web app to maybe come later\n\n* De-identified metadata for participants (age, sex, address token, etc.) and\n  samples (tissue, date, location, etc.)\n\n* Sample diagnostic results (positive/negative for influenza, RSV, and more)\n\n* Sequencing read sets and genome assemblies stored in the cloud and referenced\n  via URLs in database\n\n* Rich data types (key/value, JSON, geospatial, etc)\n\n* Strong data integrity and validation controls\n\n* Role-based authentication and restricted data fields using row and\n  column-level access control\n\n* Encrypted at rest and TLS-only connections\n\n* Administration via standard command-line tools (and maybe later\n  [pgAdmin4](https://www.pgadmin.org/))\n\n\n### Design\n\nThe database is designed as a [distribution center][] which receives data from\nexternal providers, repackages and stores it in a data warehouse, and ships\ndata back out of the warehouse via views, web APIs, and other means.  Each of\nthese three conceptual areas are organized into their own PostgreSQL schemas\nwithin a single database.\n\nThe \"receiving\" area contains tables to accept minimally- or un-controlled data\nfrom external providers.  The general expectation is that most tables here are\nlogs ([in the journaling sense][the log]) and will be processed later in\nsequential order.  For example, participant enrollment documents from our\nconsent and questionnaire app partner, Audere, are stored here when received by\nour web API.\n\nThe \"warehouse\" area contains a hybrid relational + document model utilizing\nstandard relational tables that each have a JSON column for additional details.\nData enters the warehouse primarily through extract-transform-load (ETL)\nroutines which process received data and copy it into suitable warehouse rows\nand documents.  These ETL routines are run via `bin/id3c etl` subcommands, where\nthey're defined in Python (though lean heavily on Pg itself).\n\nThe \"shipping\" area contains views of the warehouse designed with specific data\nconsumers and purposes in mind, such as the incidence modeling team.\n\nWhile the receiving and shipping areas are expected to be fairly fluid and\nreactive to new and changing external requirements, the warehouse area is\nexpected to change at a somewhat slower pace informed by longer-term vision for\nit.\n\n[distribution center]: https://en.wikipedia.org/wiki/Distribution_center\n[the log]: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying\n\n\n### Guidelines\n\nGeneral principles to follow when developing the schema.\n\n* Do the ~simplest thing that meets our immediate needs.  Aim for ease of\n  modification in the future rather than trying to guess future needs in\n  advance.\n\n  It can be really hard to stick to this principle, but it turns out that the\n  best way to make something flexible for future needs is to make it as simple\n  as possible now so it can be modified later.\n\n* Columns should be maximally typed and constrained, unless there exists a\n  concrete use case for something less.\n\n* Consider if a column should ever be unknown (null).\n\n* Consider if a column should have a default.\n\n* Consider what constraints make sense at both a column- and table-level.\n  Would a `CHECK` constraint be useful to express domain logic?\n\n* Write a description (comment) for all schemas, tables, columns, etc.\n\n* Grant only the minimal privileges necessary to the read-only and read-write\n  roles.  For example, if the read-write role isn't expected to `UPDATE`\n  existing records, then only grant it `INSERT`.\n\n* Consider expected data access patterns and create indexes to match.\n\n\n### Integration with other data systems\n\nAlthough we're building our own data system, we want to design and create it\nwith interoperability in mind.  To this extent, our system should adopt or\nparallel practices and terminology from other systems when appropriate.\nFor example:\n\n* Nouns (tables, columns, etc) in our system should consider adopting the\n  equivalent terms used by [FHIR R4](http://www.hl7.org/implement/standards/fhir/)\n  resources.\n\n  This will aid with producing FHIR documents in the future and provides a\n  consistent terminology on which to discuss concepts more broadly than our\n  group.  FHIR is a large specification and there is a lot to digest; it's\n  easy to be daunted or confused by it, so please don't hesitate to ask\n  questions.\n\n* Value vocabulary (specimen types, organism names, diagnostic tests, etc)\n  should consider using and/or referencing the preferred terms from an\n  appropriate ontology like\n  [SNOMED CT](https://www.snomed.org/snomed-ct/why-snomed-ct),\n  [LOINC](https://loinc.org),\n  or [GenEpiO](https://genepio.org/).\n\n\n### Deploying\n\nThe database schema is deployed using [Sqitch](https://sqitch.org), a database\nchange management tool that really shines.  You can install it a large number\nof ways, so pick the one that makes most sense to you.\n\n### Development\n\nFor development, you'll need a PostgreSQL server and superuser credentials for\nit.  The following commands assume the database server is running locally and\nyour local user account maps directly to a database superuser.\n\nCreate a database named `seattleflu` using the standard Pg tools.  (You can use\nanother name if you want, maybe to have different dev instances, but you'll\nneed to adjust the [sqitch target][] you deploy to.)\n\n    createdb --encoding=UTF-8 seattleflu\n\nThen use `sqitch` to deploy to it.  (`dev` is a [sqitch target][] configured in\n_sqitch.conf_ which points to a local database named `seattleflu`.)\n\n    sqitch deploy dev\n\nNow you can connect to it for interactive use with:\n\n    psql seattleflu\n\n### Testing and production\n\nOur [testing and production databases][databases doc] are configured as\n`testing` and `production` sqitch targets.  When running sqitch against these\ntargets, you'll need to provide a username via `PGUSER` and a password via an\nentry in _~/.pgpass_.\n\n\n[sqitch target]: https://metacpan.org/pod/distribution/App-Sqitch/lib/sqitch-target.pod\n[databases doc]: https://github.com/seattleflu/documentation/blob/master/infrastructure.md#databases-postgresql\n\n\n## Web API\n\nPython 3 + [Flask](http://flask.pocoo.org)\n\n* Consumes and stores enrollment documents from the Audere backend systems\n\n### Config\n\n* Database connection details are set entirely using the [standard libpq\n  environment variables](https://www.postgresql.org/docs/current/libpq-envars.html),\n  such as `PGHOST` and `PGDATABASE`. You may provide these when starting the\n  API server.\n\n  User authentication is performed against the database for each request, so\n  you do not (and should not) provide a username and password when starting the\n  API server.\n\n* The maximum accepted Content-Length defaults to 20MB.  You can override this\n  by setting the environment variable `FLASK_MAX_CONTENT_LENGTH`.\n\n* The `LOG_LEVEL` environment variable controls the level of terminal output.\n  Levels are strings: `debug`, `info`, `warning`, `error`.\n\n### Starting the server\n\nThe commands `pipenv run python -m id3c.api` or `pipenv run flask run`\nwill run the application's __development__ server. To provide database\nconnection details while starting the development server, run the command\n`PGDATABASE=DB_NAME pipenv run flask run`, substituting `DB_NAME` with the name\nof your database.\n\nFor production, a standard `api.wsgi` file is provided which can be used by any\nweb server with WSGI support.\n\n### Examples\n\nUser authentication must be provided when making POST requests to the API. For\nexample, you can run the following `curl` command to send JSON data named\n`enrollments.json` to the `/enrollment` endpoint on a local development server:\n\n```sh\ncurl http://localhost:5000/enrollment \\\n  --header \"Content-Type: application/json\" \\\n  --data-binary @enrollments.json \\\n  --user USERNAME\n```\n\nSubstitute your own local database username for `USERNAME`.  This will prompt\nyou for a password; you can also specify it directly by using `--user\n\"USERNAME:PASSWORD\"`, though be aware it will be saved to your shell history.\n\n## CLI\n\nPython 3 + [click](https://click.palletsprojects.com)\n\nInteract with the database on the command-line in your shell to:\n\n* Mint identifiers and barcodes\n\n* Run ETL routines, e.g. enrollments, to process received data into the\n  warehouse\n\n* Parse, diff, and upload sample manifests.\n\n* Preprocess clinical data and upload it into receiving.\n\n* Send Slack notifications from the [Reportable Conditions Notifications Slack\n  App](https://api.slack.com/apps/ALJJAQGKH)\n\nThe `id3c` command is the entry point.  It must be run within the project\nenvironment, for example by using `pipenv run id3c`.\n\nThe `LOG_LEVEL` environment variable controls the level of terminal output.\nLevels are strings: `debug`, `info`, `warning`, `error`.\n\n\n## Development setup\n\n### Dependencies\n\nPython dependencies are managed using [Pipenv](https://pipenv.readthedocs.io).\n\nInstall all the (locked, known-good) dependencies by running:\n\n    pipenv sync\n\nAdd new dependencies to `setup.py` and run:\n\n    pipenv lock\n    pipenv sync\n\nand then commit the changes to `Pipfile` and `Pipfile.lock`.\n\n### Connection details\n\nDetails for connecting to the ID3C database are by convention controlled\nentirely by the [standard libpq environment variables](https://www.postgresql.org/docs/current/libpq-envars.html),\n[service definitions](https://www.postgresql.org/docs/current/libpq-pgservice.html),\nand [password files](https://www.postgresql.org/docs/current/libpq-pgpass.html).\n\nFor example, if you want to list the identifier sets available in the Seattle\nFlu Study testing database, you could create the following files:\n\n_~/.pg\\_service.conf_\n\n    [seattleflu-testing]\n    host=testing.db.seattleflu.org\n    user=your_username\n    dbname=testing\n\n_~/.pgpass_\n\n    testing.db.seattleflu.org:5432:*:your_username:your_password\n\nMake sure the _~/.pgpass_ file is only readable by you since it contains your\npassword:\n\n    chmod u=rw,og= ~/.pgpass\n\nand then run:\n\n    PGSERVICE=seattleflu-testing pipenv run bin/id3c identifier set ls\n\nThese files will also allow you to connect using `psql`:\n\n    psql service=seattleflu-testing\n\n### Tests\n\nRun all tests with:\n\n    pipenv run pytest -v\n\nRun just type-checking tests with:\n\n    ./dev/mypy\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseattleflu%2Fid3c","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseattleflu%2Fid3c","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseattleflu%2Fid3c/lists"}