{"id":22096101,"url":"https://github.com/perimeterx/data-defender","last_synced_at":"2025-03-24T00:55:46.379Z","repository":{"id":180362519,"uuid":"610323329","full_name":"PerimeterX/Data-Defender","owner":"PerimeterX","description":"A tool to help organizations improve efficiency and saving cost of BigQuery data","archived":false,"fork":false,"pushed_at":"2023-03-06T14:45:16.000Z","size":14,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-01-29T07:30:29.688Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PerimeterX.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-06T14:44:31.000Z","updated_at":"2023-09-27T18:00:25.000Z","dependencies_parsed_at":null,"dependency_job_id":"3bcbdb2a-53c1-4313-82e3-27ea9916bf9c","html_url":"https://github.com/PerimeterX/Data-Defender","commit_stats":null,"previous_names":["perimeterx/data-defender"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PerimeterX%2FData-Defender","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PerimeterX%2FData-Defender/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PerimeterX%2FData-Defender/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PerimeterX%2FData-Defender/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PerimeterX","download_url":"https://codeload.github.com/PerimeterX/Data-Defender/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245191636,"owners_count":20575248,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-01T04:09:41.017Z","updated_at":"2025-03-24T00:55:46.366Z","avatar_url":"https://github.com/PerimeterX.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data-Defender\n### A tool to help organizations improve efficiency and minimize costs of BigQuery data\n\n![alt text](https://img.shields.io/badge/Licence-MIT-green)\n![alt text](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue)\n\n\n## Usage\n**Note: Since some packages aren't yet ported to the M1 architecture for macOS you may need to use another operating system in order to run Data-Defender**\nTo install the package:\n\n```\npython -m venv .virtualenv\nsource .virtualenv/bin/activate\npip install -r requirements.txt\n```\n\nThen run `main.py` passing in various values as follows:\n```\nusage: main.py [-h] --project_name PROJECT_NAME --credential_path CREDENTIAL_PATH --query QUERY [QUERY ...] [--discount DISCOUNT]\n\nAnalyse BigQuery tables for usage\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --project_name PROJECT_NAME\n                        Name of the BigQuery project to use\n  --credential_path CREDENTIAL_PATH\n                        Path to the JSON credentials file used to access BigQuery\n  --query QUERY [QUERY ...]\n                        Which query to run, valid values are 'unused_tables' and 'unused_columns'\n  --discount DISCOUNT   A decimal representation of any discount, if applicable, for BigQuery.\n```\n\nYou can pass in a single or multiple values for the `query` parameter which controls which checks will be performed. \nValid values are as follows:\n- `unused_tables` - checks for tables that haven't been used recently.\n- `unused_columns` - checks for columns that haven't been used recently.\n\nWe recommend running `unused_tables` separately from `unused_columns` in order to run faster.\nWe also recommend running `unused_columns` on specific datasets.\n\nFor example to run the `unused_tables` checks with a discount of 0.05% you would do something like this:\n```\n python main.py --project_name myProject \\\n                --credential_path /path/to/my/credentials.json \\\n                --query unused_tables \\\n                --discount 0.05\n```\n\n### Procedure\nWhen the program is run it will issue a number of queries against tables in the relevant BI `INFORMATION_SCHEMA` for your account. It will then generate summary reports in a database named `Data_Defender` in tables described below. The first time it is run these tables will be created and then updated on each subsequent run. The user calling `main.py` will thus need the relevant permissions in BigQuery to issue the corresponding SELECT and DDL commands.\n\n- `total_logs` - All query types will result in this being generated, it contains a summary of when each table was last accessed.\n\n- `unused_tables` - A report for each unused table will be generated and stored in the unused_tables table.\n- `unused_columns` - The `used_columns` query will be run first, and the resulting `used_columns` table will be used to identify the unused columns in the `unused_columns` query.\n\n#### total_logs table\n`Schema:`\\\n`user_email` - The e-mail address of the last person who called this table\\\n`job_type` - Whether it's a QUERY or VIEW\\\n`last_run_date` - The timestamp when the table was last queried \\\n`project_id` - The project ID\\\n`dataset_id` - The dataset ID\\\n`table_id` - The table ID\\\n`query` - The query that called the table\\\n`last_call` - Internal use, ordering based on timestamp to find the actual last time the table was called\n\n#### unused_tables table\n`Schema`:\\\n`full_table` - Concatenation of project_id+dataset_id+table_id\\\n`last_modified_date` - The last time the table was modified\\\n`severity_groups` - How long was this table not queried. Possible values:\n* Never Been Used\n* Not used for more than 6 m\n* Not used for 3 to 6 m\n\n`size_gb` - The size of the table\\\n`monthly_cost` - The monthly cost of storing the table\\\n`annual_cost` - The annual cost of storing the table\\\n`last_called_by` - The last person (email address) who called this table\\\n`project_id` - The project ID\\\n`dataset_id` - The dataset ID\\\n`table_id` - The table ID\\\n`type` - Whether it's a QUERY or VIEW\\\n`creation_date` - The creation date of the table\\\n\n#### used_columns table\n`Schema`:\\\n`project_id` - The project ID\\\n`dataset_id` - The dataset ID\\\n`table_id` - The table ID\\\n`column_name` - The specific column inside the table\\\n`last_run_date` - The last timestamp this column was specified in a query\n\n\n#### unused_columns table\n`Schema`:\\\n`table_name` - Concatenation of project_id+dataset_id+table_id\\\n`column_name` - The specific column inside the table\\\n`last_run_date` - The last time this column was specified in a query\\\n`severity_group` - How long was this column not queried specifically. Possible values:\n* Never Been Used\n* Not used for more than 6 m\n* Not used for 3 to 6 m\n\n\n## Example of output \n\n\u003cimg width=\"1381\" alt=\"image\" src=\"https://user-images.githubusercontent.com/68190218/217294100-00a56555-8df3-4298-96f1-484bb0c55638.png\"\u003e\n\n## Contribute\n\nAny type of contribution is warmly welcome and appreciated ❤️\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fperimeterx%2Fdata-defender","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fperimeterx%2Fdata-defender","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fperimeterx%2Fdata-defender/lists"}