{"id":21601344,"url":"https://github.com/dominodatalab/secure-datasets","last_synced_at":"2025-03-18T13:18:37.799Z","repository":{"id":248705013,"uuid":"829432571","full_name":"dominodatalab/secure-datasets","owner":"dominodatalab","description":null,"archived":false,"fork":false,"pushed_at":"2024-07-23T13:42:42.000Z","size":487,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-01-24T18:27:12.411Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dominodatalab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-16T12:18:52.000Z","updated_at":"2024-07-23T13:42:45.000Z","dependencies_parsed_at":"2024-07-16T16:33:17.380Z","dependency_job_id":"0c4a3864-6e41-4b72-9a5b-039bed67c031","html_url":"https://github.com/dominodatalab/secure-datasets","commit_stats":null,"previous_names":["dominodatalab/secure-datasets"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dominodatalab%2Fsecure-datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dominodatalab%2Fsecure-datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dominodatalab%2Fsecure-datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dominodatalab%2Fsecure-datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dominodatalab","download_url":"https://codeload.github.com/dominodatalab/secure-datasets/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244227560,"owners_count":20419261,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-24T19:09:06.478Z","updated_at":"2025-03-18T13:18:37.780Z","avatar_url":"https://github.com/dominodatalab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Service to Access Datasets Securely From an App\n\nThis is an implementation of a custom service which can be invoked from a Domino App. It provides an object\nstore view to Datasets.\n\n## Personas\n\nTo understand how this service works we establish two main personas from the perspective of the service\n\n| Persona Name        | Identification Mechanism                                                                         |\n|---------------------|--------------------------------------------------------------------------------------------------|\n| App/Workspace Owner | Pass the token returned by `$DOMINO_API_PROXY/access-token` from the app or workspace            |\n| App Caller          | This is the plain text `domino-username`. App receive as request header parameter with this name |\n\n## Concepts and Terminology\n\n| Concept                | Description                                                                                                                          |\n|------------------------|--------------------------------------------------------------------------------------------------------------------------------------|\n| Domino Service Account | This will be the App owner. Apps can only be started using a Service Account                                                         |\n| App Admission Webhook  | A K8s Admission Webhook which will prevent and App pod from being started unless the starting user is explicitly permitted           |\n| Environment            | One of [dev/test/prod]. A Domino Service Account permitted to start an app will be assigned to **exactly** one of these environments |\n| Dataset Id             | The id of the dataset as returned by one of the Domino endpoint `/v4/datasetrw/datasets-v2`|  \n| Object Key             | The fully qualified path of the file being requested for a specific `dataset-id`|\n\n\n## Service Endpoints\n\n\n| Endpoint                                  | Description                                                                                                                                  |\n|-------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|\n| /dataset/all                              | This is a convinence wrapper around the Domino Endpoint `/v4/datasetrw/datasets-v2`                                                          |\n| /dataset/canread/\u003cdataset_id\u003e/\u003cdomino_username\u003e | Does the `domino_username` have access to the `dataset_id`                                                                                   |\n| /dataset/list/\u003cdataset_id\u003e                | Return a listing of the folder for a specific domino `dataset_id`                                                                            |\n| /dataset/fetch/\u003cdataset_id\u003e         | Return a the requested object (file) for a dataset in `dataset_id` in the folder `/secure/datasets/data/{workload-owner}/{calling-username}` |  \n\n\nExample output for `/dataset/canread/\u003cdataset_id\u003e/\u003cdomino_username\u003e |`\n\n```json\n{\n    \"app_owner\": {\n        \"can_read\": true,\n        \"username\": \"user-ds-1\"\n    },\n    \"calling_user\": {\n        \"can_read\": true,\n        \"username\": \"user-ds-2\"\n    },\n    \"dataset_id\": \"662fbf1c4d68ee2895b43742\"\n}\n```\n\n\n### Mounts in the Service Endpoint \nThe last endpoints need a more detailed discussion. The service does not actually stream out the file. Instead, the following\nmounts are added to the secure datasets service to share the files with the App\n\n```text\n/secure/datasets/data\n/secure/datasets/data\n/secure/datasets/data\n/secure/datasets/data\n```\n\n\nThe following is also mounted in the service `/domino/datasets/filecache`. This is the path in the NFS (EFS for AWS) where\nDomino maintains all the dataset snapshots by `snapshot-id`\n\nAll above mounts are backed by the PVC `domino-shared-store-domino-compute-vol` which in turn is backed by the storage\nclass `dominoshared`. \n\n### Process in the endpoint `/dataset/fetch/\u003cdataset_id\u003e`\n\nWhen the endpoint `/dataset/fetch/\u003cdataset_id\u003e` is invoked the following steps take place\n\n1. The endpoint has two args:\n   - `path` - This is the fully qualified path of the file being requested\n   - `ttl` - This is an optional parameter. Default is `300`(s)\n2. The endpoint identifies of the caller of the endpoint based on either a bearer token in the request header parameter \n   `Authorization` or the DOMINO API KEY in the request header parameter `X-Domino-Api-Key`\n3. The endpoint extracts the App caller from the header parameter `domino-username`\n4. The endpoint then makes a call to the Domino endpoint `/v4/datasetrw/dataset/{dataset-id}/grants` using the App Owner token \n   to verify if both, the app-owner and the app-caller, have access to the dataset being requested\n5. If the (4) is true for both principals, the service gets the `ReadWriteSnapshotId` of the `dataset-id`. In fact \n   the solution can be extended to any `snapshotId`. It does this by invoking the endpoint `v4/datasetrw/datasets-v2`\n   using the App-Owner token.\n6. Lastly we need the exact folder which contains the files in the snapshot. This information is obtained by calling\n   the endpoint `v4/datasetrw/snapshot/{snapshot_id}` using the App Owner token. We will refer to this as `{sub-folder-for-snapshot-id}`\n7. Now the service reads the file directly using the path `/domino/datasets/fileshare/{sub-folder-for-snapshot-id}`\n8. The random file name (`{rnd_file_name}`) is generated and the file `/domino/datasets/fileshare/{sub-folder-for-snapshot-id}/{path}` is copied\n   to the folder `/secure/datasets/data/{workload-owner}/{calling-user}/{rnd_file_name}`\n9. Another file is added to the folder `/secure/datasets/metadata/{workload-owner}/{calling-user}/{rnd_file_name}.json` which stores\n   - `source_file` ==  `/domino/datasets/fileshare/{sub-folder-for-snapshot-id}/{path}`\n   - `target_file` ==  `/secure/datasets/data/{workload-owner}/{calling-user}/{rnd_file_name}`\n   - `ttl` == (Default 300s or user speciified in the request)\n10. The service monitors the folder `/secure/datasets/metadata/` every minute for expired files and removes them\n     from the `/secure/datasets/data/` folder. The App has no control over it.\n\n11. The process is out}lined in the ![figure below](./assets/app-service-interaction.png)\n\n\n### Service Endpoints\n\nThe following endpoints are exposed by the service. Each of these endpoints need to be invoked by \npassing an Authentication bearer token obtained from `$DOMINO_API_PROXY/access-token` or `DOMINO_API_KEY` in the header\nparameter `X-Domino-Api-Key`. This principal is known as the \"Workload Owner\"\n\n| Endpoint     | Description                                                                                                | \n|--------------|------------------------------------------------------------------------------------------------------------|\n| /dataset/all | This is a convinience wrapper around `v4/datasetrw/datasets-v2`. The app could just call the main endpoint |\n| /dataset/canread/\u003cdataset_id\u003e   | Status code 200 means both the caller identified by the bearer token and `domino_username` has access to the dataset id |\n| /dataset/list/\u003cdataset_id\u003e   | If permitted, list of folder contents. If not status code 403 is returned                                  |\n| /dataset/fetch/\u003cenv\u003e/\u003cdataset_id\u003e   | For a path defined as a argument, returns the dataset file if permitted to access dataset           |\n\n\n### Service Parameters\n\nFor each of the endpoints below, the access token recognizes the `workload-owner` and the request header contains\na parameter `domino-username` which identifies the `calling-username`\n\n| Endpoint     | Parameters                                                    | \n|--------------|---------------------------------------------------------------|\n| /dataset/all | None                                                          |\n| /dataset/canread/\u003cdataset_id\u003e | None                                                          |\n| /dataset/list/\u003cdataset_id\u003e   | `path` the folder for which listing is desired                |\n| /dataset/fetch/\u003cdataset_id\u003e   | `path` full path of the file, `ttl` time to live for the file |\n\n\n## Scenarios\n\nIn this section we will run through scenarios. \n\n### Named Users\nLet us assume the following named users \n\n\n| User               | Role          | Is Power User for Secure Datasets Service | \n|--------------------|---------------|-------------------------------------------|\n| `integration-test` | Administrator | Yes                                       |\n| `user-ds-1`        | Practitioner  | No                                        |\n| `user-ds-2`        | Practitioner  | No                                        |\n| `user-ds-3`        | Practitioner  | No                                        |\n| `dev-app-1-owner`  | Practitioner  | No                                        |\n| `test-app-1-owner` | Practitioner  | No                                        |\n| `prod-app-1-owner` | Practitioner  | No                                        |\n\nThe power user will have a designated project (via the helm install) where the following\nfolders are mounted\n```text\n/secure/datasets/data\n/secure/datasets/metadata\n```\nThe purpose is to allow a power user to debug situations. Note that a Domino\nAdministrator as full access to all Datasets. This does not violate any existing \naccess control contract in Domino. This is a powerful super-user feature and \nrecommend using it very cautiously\n\n### Domino Service Accounts\n\nNext we assume the following Domino Service Accounts \n\n| User              | Role | \n|-------------------|------|\n| `svc-user-ds-1`   | Dev  |\n| `svc-user-ds-2`   | Dev  |\n| `svc-user-ds-3`   | Dev  |\n| `svc-test-app-1\t` | Test |\n| `svc-prod-app-1\t` | Prod |\n\n| User             | Token owned by user | \n|------------------|---------------------|\n| `svc-ds-user-1`  | `user-ds-1`         |\n| `svc-ds-user-2`  | `user-ds-2`         |\n| `svc-ds-user-3`  | `user-ds-3`         |\n| `svc-test-app-1` | `user-ds-2`         |\n| `svc-prod-app-1` | `integration-test`  |\n\nNote that we have user  `user-ds-2`  has roles of a developer and tester\n`integration-test` has roles of `production` user.\n\nOne user can assume multiple roles. But a service account can belong to zero or one\nenvironment (from an App perspective). This decides which folder is mounted for the App\nto share dataset objects with the Secure Datasets Service. \n\n### Domino Datasets\n\n| Dataset          | Owner              | Writer             | Reader                                                    | \n|------------------|--------------------|--------------------|-----------------------------------------------------------|\n| `dataset-1`      | `user-ds-1`        | `user-ds-1`        | `user-ds-1`,`user-ds-2`,`user-ds-3`                       |\n| `dataset-2`      | `user-ds-2`        | `user-ds-2`        | `user-ds-2`,`user-ds-3`                                   |\n| `dataset-3`      | `user-ds-3`        | `user-ds-3`        | `user-ds-3`                                               |\n| `dev-dataset-1`  | `dev-app-1-owner`  | `dev-app-1-owner`  | `svc-user-ds-[1,2,3]`,`user-ds-1`,`user-ds-2`,`user-ds-3` |\n| `dev-dataset-2`  | `dev-app-1-owner`  | `dev-app-1-owner`  | `svc-user-ds-[1,2,3]`,`user-ds-2`,`user-ds-3`             |\n| `dev-dataset-3`  | `dev-app-1-owner`  | `dev-app-1-owner`  | `svc-user-ds-[1,2,3]`,`user-ds-3`                         |\n| `test-dataset-1` | `test-app-1-owner` | `test-app-1-owner` | `svc-test-app-1`,`user-ds-1`,`user-ds-2`,`user-ds-3`      |\n| `test-dataset-2` | `test-app-1-owner` | `test-app-1-owner` | `svc-test-app-1`,`user-ds-2`,`user-ds-3`                  |\n| `test-dataset-3` | `test-app-1-owner` | `test-app-1-owner` | `svc-test-app-1`,`user-ds-3`                              |\n| `prod-dataset-1` | `prod-app-1-owner` | `prod-app-1-owner` | `svc-prod-app-1`,`user-ds-1`,`user-ds-2`,`user-ds-3`      |\n| `prod-dataset-2` | `prod-app-1-owner` | `prod-app-1-owner` | `svc-prod-app-1`,`user-ds-2`,`user-ds-3`                   |\n| `prod-dataset-3` | `prod-app-1-owner` | `prod-app-1-owner` | `svc-prod-app-1`,`user-ds-3`                               |\n\nThe above should make it clear how adding data to datasets and having read access to datasets is \nclearly separated. Named Domino users have permissions to write to the datasets they own.\n\nService Account users can have read access to specific datasets and other named users can also \nhave read access.\n\nAlso note how the development dataset are recommended to be separate from production datasets. This enforces a conscious \ndecision making  process regarding allowing access control to datasets via apps . This is not required but having this \nlevel of governance ensures that the integrity of your data access protocols is ensured.\n\n\n### How a workload mount looks like depending on Workload Type and Starting User\n\n#### Workspaces \n\n| Workload Owner | Mount                                                   | \n|----------------|---------------------------------------------------------|\n| `integration-test` | /secure/datasets/data/integration-test/integration-test |\n| `user-ds-1`        | /secure/datasets/data/user-ds-1/user-ds-1               |\n| `user-ds-2`        | /secure/datasets/data/user-ds-2/user-ds-2               |\n| `user-ds-3`        | /secure/datasets/data/user-ds-3/user-ds-3               |\n| `dev-app-1-owner`  | /secure/datasets/data/dev-app-1-owner/dev-app-1-owner   |\n| `test-app-1-owner` | /secure/datasets/data/test-app-1-owner/test-app-1-owner |\n| `prod-app-1-owner` | /secure/datasets/data/prod-app-1-owner/prod-app-1-owner                 |\n\nFrom inside a workspace anyone can invoke the `secure-datasets-svc` but they can only\nrequest a dataset object for themselves. If they invoke it for any other user even if the \nworkspace owner has access to the dataset, it won't be visible to the caller because the\nuser folder for the \"calling-username\" that is mounted is identical to the \"workload-owner\"\n\n\n#### App \n\nAn App has the following mounts\n`/secure/datasets/data/{workload-owner}`\n`/secure/datasets/metadata/{workload-owner}`\n\n| Workload Owner     | Mount                                                                                 | \n|--------------------|---------------------------------------------------------------------------------------|\n| `svc-ds-user-1`    | `/secure/datasets/data/svc-ds-user-1` \u003cbr/\u003e `/secure/datasets/metadata/svc-ds-user-1` |\n| `svc-ds-user-2`    | `/secure/datasets/data/svc-ds-user-2` \u003cbr/\u003e `/secure/datasets/metadata/svc-ds-user-2` |\n| `svc-ds-user-3`    | `/secure/datasets/data/svc-ds-user-3` \u003cbr/\u003e `/secure/datasets/metadata/svc-ds-user-3` |\n| `svc-test-app-1`   | `/secure/datasets/data/svc-test-app-1` \u003cbr/\u003e `/secure/datasets/metadata/svc-test-app-1`|\n| `prod-test-app-1`  | `/secure/datasets/data/prod-test-app-1` \u003cbr/\u003e `/secure/datasets/metadata/prod-test-app-1`    |\n\nNote that while the service accounts are owned by named users, they have access to different datasets\nas compared to the named users. It is imperative that a proper process be followed to add data to\ndatasets accessible to service accounts.\n\n\nThe `secure-datasets-svc` has all the possible mounts. \n\n`/secure/datasets/data/`\n`/secure/datasets/metadata/`\n\n\u003e An important aspect of this design is datasets need to be logically divided by {workload-owners}\n\u003e Consequently for a given App workload the only datasets that can be read are the ones available to the App Owner\n\u003e The Caller can only access datasets that are accessbile to both the App-Owner and the App-Caller\n\u003e The enforcement aspect of only allowing designated service accounts to deploy Apps ensures that proper process is \n\u003e followed when providing dataset access to service accounts, code-review to ensure code is not malicious or buggy \n\u003e causing data leaks\n\n\n## Process\n\n1. A user will develop code against raw datasets in the workspace\n2. A user will deploy code to an App using a service account assigned to them\n   - The user will request a test dataset to be created and allocated to them by `dev-app-1-owner`\n   - It is the responsibility of the `dev-app-1-owner` to not include sensitive data in these datasets\n3. Next the test user will deploy the App using a service account assigned to them\n   - The user will request a test dataset to be created and allocated to them by `test-app-1-owner`\n   - It is the responsibility of the `test-app-1-owner` to not include sensitive data in these datasets\n3. Lastly a prod user will deploy the App using a service account assigned to them\n   - The user will request a prod dataset to be created by `prod-app-1-owner`\n   - It is the responsibility of the `prod-app-1-owner` to ensure that only the right users have access to these datasets\n\n## Installation Pre-requisites\n\nInstall [Domsed](https://github.com/dominodatalab/domino-field-solutions-installations/tree/main/domsed)\n\n## Installation\n\n0. Build the image by running the following command\n```shell\nexport tag=v0.0.1\n./create_image.sh $tag\n```\n\n1. First create a Service Account in Domino and note its name and the token value. Our Service Account name \n   in the demo env `secure-ds-access-sa`\n\n2. The installation is a helm chart. See the `values.yaml` in the root folder. The only values unique to your deployment\nwill be:\n   \n    - **admin_user**: `integration-test` (Pick one Domino Administrator who can test this service from a local \n        workspace. This Admin's workspace for project with Domino Project Id mentioned in the next step \n        will have the same configuration as the chosen App started by a Service Account)\n   \n    - **admin_project_id**: `662a70a548ce7c5c14d8d3ce` (This is the Domino Project Id for the `quick-start`\n   project owned by the `integration-test` user) \n   \n    - **sa_user_name**: secure-ds-access-sa (This is the service account who will have the token\n   mounted which will permit it to allow calling the secure datasets access service)\n    \n   Perform the helm install\n   ```shell\n        export compute_namespace=domino-compute\n        helm install -f values.yaml secure-dataset-access-service helm/secure-dataset-access-service -n ${compute_namespace}\n    ```\n    The corresponding commands for helm delete and upgrade are:\n    ```shell\n        export compute_namespace=domino-compute\n        helm delete secure-dataset-access-service -n ${compute_namespace}\n    ```\n\n   ```shell\n        export compute_namespace=domino-compute\n        helm upgrade secure-dataset-access-service helm/secure-dataset-access-service -n ${compute_namespace}\n    ```\n\n   \n## Improvments\n\n1. Verify the token is signed \n2. Use gunicorn to scale the secure datasets service\n3. Can we use symlinks intead of copying files\n\n\n## Starting an App from Code\n\n\nTo start the app, first start it manually and note the id of the `modelProducts` and shut down the app\n\n```python\nimport requests\nimport json\nimport os\n\ndomino_api_host = os.environ[\"DOMINO_API_HOST\"]\n\nmodelProductId='6632694cb0361c7a46730640'\ntoken_file = \"/etc/tokens/token\"\nsa_token = \"\"\nwith open(token_file, 'r') as file:\n    # Read the entire file content into a variable\n    sa_token = file.read()\nurl = f\"{domino_api_host}/{modelProductId}/start\"\n      \n\npayload = json.dumps({\n  \"hardwareTierId\": \"small-k8s\",\n  \"environmentId\": \"662a613292c1117e2b322d02\",\n  \"externalVolumeMountIds\": []\n})\nheaders = {\n  'Content-Type': 'application/json',\n  'Authorization': f\"Bearer {sa_token}\"\n  }\n\nresponse = requests.request(\"POST\", url, headers=headers, data=payload)\nprint(response.status_code)\nprint(response.text)\n```\nTo stop the App do the following:\n```python\nimport requests\nimport os\n\ndomino_api_host = os.environ[\"DOMINO_API_HOST\"]\n\nmodelProductId='6632694cb0361c7a46730640'\ntoken_file = \"/etc/tokens/token\"\nsa_token = \"\"\nwith open(token_file, 'r') as file:\n    # Read the entire file content into a variable\n    sa_token = file.read()\nurl = f\"{domino_api_host}/{modelProductId}/stop?force=true\"\n\nheaders = {\n  'Content-Type': 'application/json',\n  'Authorization': f\"Bearer {sa_token}\"\n  }\n\nresponse = requests.request(\"POST\", url, headers=headers, data={})\n\nprint(response.status_code)\nprint(response.text)\n\n```\n\nThe app started with any other user (including an Admin) will not have the sa tokens mounted\nto access the secure-dataset access service. \n\nWhich is why to test the functionality we designate an Domino Admin User and a Project owned by the Domino Admin\n\nThe [notebook](./notebooks/app_emulation.ipynb) is provided for this test\n\n## Next Steps\n\n1. Add audit capabilities\n   - Datetime\n   - Dataset Id\n   - Snapshot Id\n   - Orginal Source filepath\n   - Target Temporary filepath\n   - Ttl\n   - Workload Owner\n   - Workload Caller\n2. Support reading of any snapshots\n3. Remove dataset mounts\n4. Horizontal Scaling of Secure Datasets Service\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdominodatalab%2Fsecure-datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdominodatalab%2Fsecure-datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdominodatalab%2Fsecure-datasets/lists"}