{"id":16930684,"url":"https://github.com/avdata99/testing-datajson-external-harvester","last_synced_at":"2025-03-21T03:25:01.279Z","repository":{"id":72337919,"uuid":"193757866","full_name":"avdata99/testing-datajson-external-harvester","owner":"avdata99","description":null,"archived":false,"fork":false,"pushed_at":"2023-05-22T22:29:34.000Z","size":30191,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-26T00:11:56.712Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/avdata99.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-25T17:57:23.000Z","updated_at":"2019-06-28T18:45:21.000Z","dependencies_parsed_at":"2023-05-26T23:00:31.512Z","dependency_job_id":null,"html_url":"https://github.com/avdata99/testing-datajson-external-harvester","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avdata99%2Ftesting-datajson-external-harvester","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avdata99%2Ftesting-datajson-external-harvester/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avdata99%2Ftesting-datajson-external-harvester/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avdata99%2Ftesting-datajson-external-harvester/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/avdata99","download_url":"https://codeload.github.com/avdata99/testing-datajson-external-harvester/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244729662,"owners_count":20500279,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T20:42:14.983Z","updated_at":"2025-03-21T03:25:01.256Z","avatar_url":"https://github.com/avdata99.png","language":"Python","readme":"# ETL for data.json\n\nTraining for ETL in data.json files.  \n\n## Read data.json\n\n```\nusage: harvest_data_json.py [-h] [--url URL] [--name NAME]\n                            [--request_timeout REQUEST_TIMEOUT]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --url URL             URL of the data.json\n  --name NAME           Name of the resource (for generate the containing\n                        folder)\n  --request_timeout REQUEST_TIMEOUT\n                        Request data.json URL timeout\n```\n### Real examples\n\n```\npython3 harvest_data_json.py --name exim --url http://data.exim.gov/data.json\n\npython3 harvest_data_json.py --name \"Western Pennsylvania Regional Data Center\" --url https://data.wprdc.org/data.json\n\npython3 harvest_data_json.py --name energy-data --url https://www.energy.gov/sites/prod/files/2019/04/f61/doe-pdl-4-8-2019_0.json\n\n#NASA requires more time (--request_timeout), it has 87MB (24432 datasets) of data.json.\npython3 harvest_data_json.py --name NASA-data --url https://data.nasa.gov/data.json --request_timeout 90\n\n```\n\n## OK example\n\n```\nDownloaded OK\nJSON OK\nValidate OK. 1 datasets\n - Dataset: Authorizations From 10/01/2006 Thru 12/31/2018: This file contains all authorizations approved between 10/01/2006 and 12/31/2018\nPlease note that the asterisked Working Capital transactions were extended during the period of EXIM Bank’s lapse in authority in conformance with original authorization agreements. These deals were originally authorized before the lapse as multiyear facilities with annual extensions. This record represents the extension of the prior authorization. EXIM did not authorize new business during its lapse in authority. \n```\n\n## Error example\n```\nDownloaded OK\nJSON OK\nError validating data: Error validating JsonSchema: '[[REDACTED-EX B6]]' is not of type 'array'\n\nFailed validating 'type' in schema['properties']['dataset']['items']['properties']['keyword']:\n    {'description': 'Tags (or keywords) help users discover your dataset; '\n                    'please include terms that would be used by technical '\n                    'and non-technical users.',\n     'items': {'minLength': 1, 'type': 'string'},\n     'minItems': 1,\n     'title': 'Tags',\n     'type': 'array'}\n\nOn instance['dataset'][43]['keyword']:\n    '[[REDACTED-EX B6]]'\n----------------\nValidate FAILED. 2868 datasets\n\n - Dataset: Agency Parking: Agency parking application that provides the capability to record and query parking assignments. Access is limited to designated personnel of the Facilities and Logistics\n - Dataset: Congressional and Intergovernmental Affairs webpage: The Office of Congressional and Intergovernmental Affairs is dedicated to its mission of providing guidance on legislative and policy issues, informing constituencies on energy matters, and serving as a liaison between the Department, Congress, State, local, and Tribal governments, as well as other Federal agencies and stakeholder groups.\n - Dataset: DATA Act for U.S. Department of Energy: This is a link where the U.S. Department of Energy DATA Act reporting can be found.\n - Dataset: Agency IT Policy Archive: IT Policy Archive\n \n ...\n ...\n\n```\n\n## Read Ckan API\n\nGet paginated resources from a CKAN instance\nTested with data.gov\n\n```\npython3 data_json_harvest/data_gov_api.py\n\nSearching https://catalog.data.gov/api/3/action/package_list PAGE:1 start:0, rows:1000\n1000 results\n4348 total resources\nSearching https://catalog.data.gov/api/3/action/package_list PAGE:2 start:1000, rows:1000\n1000 results\n9615 total resources\n\n...\n\nSearching https://catalog.data.gov/api/3/action/package_list PAGE:153 start:152000, rows:1000\n1000 results\n735321 total resources\nSearching https://catalog.data.gov/api/3/action/package_list PAGE:154 start:153000, rows:1000\n1000 results\n744662 total resources\n\n...\n\n\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favdata99%2Ftesting-datajson-external-harvester","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Favdata99%2Ftesting-datajson-external-harvester","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favdata99%2Ftesting-datajson-external-harvester/lists"}