{"id":25405045,"url":"https://github.com/justingosses/data_dot_json_over_time","last_synced_at":"2025-09-09T06:54:06.662Z","repository":{"id":276556487,"uuid":"925877675","full_name":"JustinGOSSES/data_dot_json_over_time","owner":"JustinGOSSES","description":"Prototype for seeing changes in data.json for NASA and other agencies over time using wayback machine","archived":false,"fork":false,"pushed_at":"2025-02-11T04:33:18.000Z","size":232,"stargazers_count":1,"open_issues_count":4,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-29T06:35:45.449Z","etag":null,"topics":["civic-tech","data-dot-gov","data-dot-json","metadata","open-data","open-gov"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JustinGOSSES.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-02T00:31:28.000Z","updated_at":"2025-02-09T04:39:05.000Z","dependencies_parsed_at":"2025-02-09T02:23:35.696Z","dependency_job_id":"ea7e7f6d-1b6f-4b6d-aef7-8489f665f7cc","html_url":"https://github.com/JustinGOSSES/data_dot_json_over_time","commit_stats":null,"previous_names":["justingosses/data_dot_json_over_time"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/JustinGOSSES/data_dot_json_over_time","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinGOSSES%2Fdata_dot_json_over_time","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinGOSSES%2Fdata_dot_json_over_time/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinGOSSES%2Fdata_dot_json_over_time/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinGOSSES%2Fdata_dot_json_over_time/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JustinGOSSES","download_url":"https://codeload.github.com/JustinGOSSES/data_dot_json_over_time/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinGOSSES%2Fdata_dot_json_over_time/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274258772,"owners_count":25251555,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-09T02:00:10.223Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["civic-tech","data-dot-gov","data-dot-json","metadata","open-data","open-gov"],"created_at":"2025-02-16T04:29:58.381Z","updated_at":"2025-09-09T06:54:06.613Z","avatar_url":"https://github.com/JustinGOSSES.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# data_dot_json_over_time\nPrototype for seeing changes in data.json at NASA and other U.S. government agencies over\ntime using the Internet Archive's Wayback Machine.\n\n_To jump down to lessons learned see the section [Early analysis learnings](#early-analysis-learnings)_\n\n## Function\n\nThe purpose of the code in this repository is to analyze and compare snapshots of \ndata.json JSON files that hold metadata of US government datasets.\nIt uses the [Internet Archive's snapshots](https://webcf.waybackmachine.org/) snapshots\nof content that exists at different webpage URLs to find older data.json and \nthen compare that to a data.json harvested from a live website today. \nThis allows for comparison of how metadata in data.json files has changed over time. \n\nFor example, one of the first questions analyzed is whether there are any dataset's\nmetadata that used to be in the data.json prior to January 20th, 2025 that is no longer\nin a more recent data.json. Additionally, for these datasets do the datasets distribution\nURLs still respond with a 200 request when hit or a 404 indicating the distribution download\npage may have been removed.\n\nOther questions may or may not be analyzed additionally over time if work on this side project\ncontinues. \n\n## What is data.json and data.gov?\n\nAs part of the US government's federal open data policies and processes,\n[data.gov](https://catalog.data.gov/dataset/) was stood up as a central catalog of dataset\nmetadata. There is a standardized metadata format and all U.S. government agencies were\nrequired to create their own data.json file documenting their datasets in this format that\nthen got consumed into data.gov, which acts as a central catalog.\n\nMore details can be found at: \n- https://data.gov/user-guide/\n- https://resources.data.gov/about/governance/\n- https://resources.data.gov/resources/dcat-us/\n\n### Why analyze US Government agency data.json \n\nAs the centralized attempt to put dataset metadata in one place in one format, it is a\nmore obvious place to try to measure changes over time.\nAdditionally, because each agency has their own data.json (even if sometimes hard to find)\nit enables analysis on the agency level as well.\n\n### Complications of analyzing data.json files and data.gov\n\nA high level list of reasons why things get complicated:\n\n- Data.json and data.gov almost always lag actual data availability:\n    - Data.json's are a lagging indicator. By the time datasets are removed from data.json they\n        have likely already been removed from the actual download pages.\n- Speed of updates is highly variable:\n    - While some data is stored in systems that automatically send changes to data.gov and the system that updates each agency's data.json whenever the data itself changes, this is probably rare. In many cases, there can be a yearly\n    or monthly \"pull\" of data from various data systems into data.json and data.gov. In some cases, metadata for a\n    dataset in a data.json will only be changed if someone goes in and manually changes it.\n- Nested systems:\n    - Fundamentally, the metadata in data.json tends to be metadata description of a description\n    of the metadata of another system. Sometimes there are 2 layers to this, sometimes 5 or 6.\n    For example, at NASA there might be a dataset with one set of metadata from the system that collected it.\n    That metadata is used to partially create the metadata fields in the system that stores it.\n    There might be a catalog of conceptually similar datasets that consume that metadata to match its own\n    format that goes across those conceptually similar datasets. Then NASA converts, shrinks, takes some\n    of that  metadata to make its data.json at the agency level, which then gets consumed in data.gov\n    metadata catalog.\n- Not everything is simply singular file:\n    - For datasets that get large, it is not uncommon to store them in systems that let a user extract\n    just the parts they are concerned above. This is common for geospatial data where a system might\n    let a user extract data for a particular area of coverage, date, version, and conceptual layer.\n    As those change over time or how they are described changes in metadata it can appear as it\n    large amount of data are removed or added in data.json when in fact no data is actually being removed.\n- Versions:\n    - The same dataset can also appear to disappear or multiple as different versions and edits of it appear.\n    Sometimes older version are viewed as \"not accurate\" and removed but a newer and more accurate version still exists.\n\n\n## How to use the code in this repository\n\n### Structure \n\nThere is a configuration yaml file at [`config.yml`](./config.yml) that holds details of which agency you want to harvest\na data.json from, which dates, and where output data should be saved. The Python files consume this information.\nIt is expected some users will not write any addition python code and just change a few variable in the\nconfig.yml to harvest and analyze different agency data from different dates. The particulars of what to\nchange and what not to will get addressed further on.\n\nThe Python file [`src/tools.py`](./src/tools.py) holds a variety of small functions that act as tools or utilities.\nIf following the \"assumed action\" section below, you will not call any of these functions directly\nbut you might if you reuse them to analyze different aspects of data.json than initially targeted.\n\nThe Python file\n[`src/get_list_available_snapshots_in_wayback_machine.py`](./src/get_list_available_snapshots_in_wayback_machine.py)\nreads the config.yml file, gets the URL to the agency data.json you're after. It then sends that URL to the\nInternet Archive's Wayback machine to get a listing of all the times the Waybak machine attempted to snapshot\nthat URL and whether the result was successful (200 status message) or some other http status message result.\nThe function processes this information into a JSON and saves it in the\n`/data/{agency_name}/snapshots_available_in_archive/` folder. The filename is the date of when you made the\nAPI call to get this information from the Wayback machine.\n\n\nThe Python file [`src/harvest.py`](./src/harvest.py) starts off doing the exact same thing as the\n[`src/get_list_available_snapshots_in_wayback_machine.py`](./src/get_list_available_snapshots_in_wayback_machine.py)\nfile, but then additionally collects the data.json file from (1) the date(s) defined in the config.yml for the\nwayback machine and (2) the current data.json that exists at the agency live URL for the data.json.\nThe [`src/get_list_available_snapshots_in_wayback_machine.py`](./src/get_list_available_snapshots_in_wayback_machine.py)\nexists as its own function, because it is very likely people will want to see all the dates where a snapshot of the data.json\nexists first and then decide which to ask the Wayback machine for. There can be 50 or 100 snapshots of some agency data.json.\nNext, this function processes the data in those data.json to make it easier to analyze. As all URLs in the data.json from the\nwayback machine are preceeded by the Wayback machine address, it removes those. Next to make comparison of data files easier\nand faster, it changes the structure somewhat. In the data.json [DCAT](https://resources.data.gov/resources/dcat-us/)\nformat, each dataset is described by a separate object or dictionary with defined keys. All of those objects are in a big list.\nTo make comparison of data.json easier, we change the structure from a list to an object where the 'identified' field is extracted\nfrom the object and make into a key for each object.\n\nA non-processed example snippet of a data.json is below. Each dataset is described in a object {} that follows a format defined in\n[DCAT](https://resources.data.gov/resources/dcat-us/). \n\n```\ndatasets:\n[\n        {\n            \"accessLevel\": \"public\",\n            \"landingPage\": \"https://pds.nasa.gov/ds-view/pds/viewDataset.jsp?dsid=RO-E-RPCMAG-2-EAR2-RAW-V3.0\",\n            \"bureauCode\": [\n                \"026:00\"\n            ],\n            \"issued\": \"2018-06-26\",\n            \"@type\": \"dcat:Dataset\",\n            \"modified\": \"2023-01-26\",\n            \"references\": [\n                \"https://pds.nasa.gov\"\n            ],\n            \"keyword\": [\n                \"international rosetta mission\",\n                \"unknown\",\n                \"earth\"\n            ],\n            \"contactPoint\": {\n                \"@type\": \"vcard:Contact\",\n                \"fn\": \"Thomas Morgan\",\n                \"hasEmail\": \"mailto:thomas.h.morgan@nasa.gov\"\n            },\n            \"publisher\": {\n                \"@type\": \"org:Organization\",\n                \"name\": \"National Aeronautics and Space Administration\"\n            },\n            \"identifier\": \"urn:nasa:pds:context_pds3:data_set:data_set.ro-e-rpcmag-2-ear2-raw-v3.0_222f-2gsy\",\n            \"description\": \"This dataset contains EDITED RAW DATA of the second Earth Flyby (EAR2). The closest approach (CA) took place on November 13, 2007 at 20:57\",\n            \"title\": \"ROSETTA-ORBITER EARTH RPCMAG 2 EAR2 RAW V3.0\",\n            \"programCode\": [\n                \"026:005\"\n            ],\n            \"distribution\": [\n                {\n                    \"@type\": \"dcat:Distribution\",\n                    \"downloadURL\": \"https://www.socrata.com\",\n                    \"mediaType\": \"text/html\"\n                }\n            ],\n            \"accrualPeriodicity\": \"irregular\",\n            \"theme\": [\n                \"Earth Science\"\n            ]\n        },\n        {'another datasets metadata object here'},\n        {'another datasets metadata object here'},\n        .....\n]\n```\nOnce we have the data in an object instead a list, comparison of two data.json to find\ndifferences is a little faster. The `missing_keys` sub-directory is where we put the\nresults of these comparisons. We also create a json that attempts to hit every\nlanding page URL and dataset distribution URL with a link checker to see what the status\nof each link is in order to find 404s.\n\nExample snippet of what this gets us is a list of datasets that don't exist in the more\nrecent data.json but did exist in an older one and a check for whether the URLs associated\nwith those datasets are active or not.\n\nFor example in the example below, all the data is for dataset identifiers that existed in the\nolder data.json that don't exist in a younger version.\nThe URL `\"https://disc.gsfc.nasa.gov/datacollection/MICASA_FLUX_3H_1.html\"`\nis one datasets download distribution point and `\"20250206001603\"` is the date and time\nit was checked and `200` was the status result, meaning the page responded that it was still there.\n\n```\n            \"C3273640138-GES_DISC\": {\n                \"accessLevel\": \"public\",\n                \"landingPage\": \"https://doi.org/10.5067/AS9U6AWVTY69\",\n                \"bureauCode\": [\n                    \"026:00\"\n                ],\n                \"citation\": \"Brad Weir. 2024-09-25. MICASA_FLUX_3H. Version 1. MiCASA 3-hourly NPP NEE Fluxes 0.1 degree x 0.1 degree. Greenbelt, MD, USA. Archived by National Aeronautics and Space Administration, U.S. Government, Goddard Earth Sciences Data and Information Services Center (GES DISC). https://doi.org/10.5067/AS9U6AWVTY69. https://disc.gsfc.nasa.gov/datacollection/MICASA_FLUX_3H_1.html. Digital Science Data.\",\n                \"issued\": \"2024-09-22\",\n                \"temporal\": \"2001-01-01T00:00:00Z/2023-12-31T23:59:59.999Z\",\n                \"@type\": \"dcat:Dataset\",\n                \"modified\": \"2024-09-22\",\n                \"keyword\": [\n                    \"carbon flux\",\n                    \"climate indicators\",\n                    \"earth science\"\n                ],\n                \"data-presentation-form\": \"Digital Science Data\",\n                \"contactPoint\": {\n                    \"@type\": \"vcard:Contact\",\n                    \"fn\": \"Brad Weir\",\n                    \"hasEmail\": \"mailto:brad.weir@nasa.gov\"\n                },\n                \"publisher\": {\n                    \"@type\": \"org:Organization\",\n                    \"name\": \"NASA/GSFC/SED/ESD/ESISL/GESDISC\"\n                },\n                \"identifier\": \"C3273640138-GES_DISC\",\n                \"description\": \"MiCASA is an extensive revision of CASA-GFED3. CASA-GFED3 derives from Potter et al. (1993), diverging in development since Randerson et al. (1996). CASA is a light use efficiency model: NPP is expressed as the product of photosynthetically active solar radiation, a light use efficiency parameter, scalars that capture temperature and moisture limitations, and fractional absorption of photosynthetically active radiation (fPAR) by the vegetation canopy derived from satellite data. Fire parameterization was incorporated into the model by van der Werf et al. (2004) leading to CASA-GFED3 after several revisions (van der Werf et al., 2006, 2010). Development of the GFED module has continued, now at GFED5 (Chen et al., 2023) with less focus on the CASA module. MiCASA diverges from GFED development at version 3, although future reconciliation is possible. Input datasets include air temperature, precipitation, incident solar radiation, a soil classification map, and several satellite derived products. These products are primarily based on Moderate Resolution Imaging Spectroradiometer (MODIS) Terra and Aqua combined datasets including land cover classification (MCD12Q1), burned area (MCD64A1), Nadir BRDF-Adjusted Reflectance (NBAR; MCD43A4), from which fPAR is derived, and tree/herbaceous/bare vegetated fractions from Terra only (MOD44B). Emissions due to fire and burning of coarse woody debris (fuel wood) are estimated separately.\",\n                \"release-place\": \"Greenbelt, MD, USA\",\n                \"series-name\": \"MICASA_FLUX_3H\",\n                \"creator\": \"Brad Weir\",\n                \"title\": \"MiCASA 3-hourly NPP NEE Fluxes 0.1 degree x 0.1 degree\",\n                \"graphic-preview-file\": \"https://docserver.gesdisc.eosdis.nasa.gov/public/project/CMS/micasa_v1_sample.jpg\",\n                \"programCode\": [\n                    \"026:001\"\n                ],\n                \"distribution\": [\n                    {\n                        \"mediaType\": \"text/html\",\n                        \"downloadURL\": \"https://scholar.google.com/scholar?q=10.5067%2FAS9U6AWVTY69\",\n                        \"description\": \"Search results for publications that cite this dataset by its DOI.\",\n                        \"@type\": \"dcat:Distribution\",\n                        \"title\": \"Google Scholar search results\"\n                    },\n                    {\n                        \"@type\": \"dcat:Distribution\",\n                        \"downloadURL\": \"https://docserver.gesdisc.eosdis.nasa.gov/public/project/CMS/micasa_v1_sample.jpg\",\n                        \"mediaType\": \"image/jpeg\",\n                        \"title\": \"Get a related visualization\"\n                    },\n                    {\n                        \"mediaType\": \"text/html\",\n                        \"downloadURL\": \"https://disc.gsfc.nasa.gov/datacollection/MICASA_FLUX_3H_1.html\",\n                        \"description\": \"Access the dataset landing page from the GES DISC website.\",\n                        \"@type\": \"dcat:Distribution\",\n                        \"title\": \"This dataset's landing page\"\n                    },\n                    {\n                        \"mediaType\": \"text/html\",\n                        \"downloadURL\": \"https://acdisc.gsfc.nasa.gov/data/CMS/MICASA_FLUX_3H.1/\",\n                        \"description\": \"Access the data via HTTPS.\",\n                        \"@type\": \"dcat:Distribution\",\n                        \"title\": \"Download this dataset through a directory map\"\n                    },\n                    {\n                        \"mediaType\": \"text/html\",\n                        \"downloadURL\": \"https://acdisc.gsfc.nasa.gov/opendap/CMS/MICASA_FLUX_3H.1/\",\n                        \"description\": \"Access the data via the OPeNDAP protocol.\",\n                        \"@type\": \"dcat:Distribution\",\n                        \"title\": \"Use OPeNDAP to access the dataset's data\"\n                    },\n                    {\n                        \"mediaType\": \"application/pdf\",\n                        \"downloadURL\": \"https://acdisc.gsfc.nasa.gov/data/CMS/MICASA_FLUX_D.1/doc/MiCASA_README.pdf\",\n                        \"description\": \"README Document\",\n                        \"@type\": \"dcat:Distribution\",\n                        \"title\": \"View this dataset's read me document\"\n                    },\n                    {\n                        \"mediaType\": \"text/html\",\n                        \"downloadURL\": \"https://search.earthdata.nasa.gov/search?q=MICASA_FLUX_3H\",\n                        \"description\": \"Use the Earthdata Search to find and retrieve data sets across multiple data centers.\",\n                        \"@type\": \"dcat:Distribution\",\n                        \"title\": \"Download this dataset through Earthdata Search\"\n                    }\n                ],\n                \"spatial\": \"-180.0 -90.0 179.0 90.0\",\n                \"theme\": [\n                    \"CMS\",\n                    \"geospatial\"\n                ],\n                \"language\": [\n                    \"en-US\"\n                ],\n                \"url_status_checks\": {\n                    \"distributions_downloadURLs\": {\n                        \"https://disc.gsfc.nasa.gov/datacollection/MICASA_FLUX_3H_1.html\": {\n                            \"20250206001603\": 200\n                        },\n                        \"https://docserver.gesdisc.eosdis.nasa.gov/public/project/CMS/micasa_v1_sample.jpg\": {\n                            \"20250206001603\": 200\n                        },\n                        \"https://acdisc.gsfc.nasa.gov/opendap/CMS/MICASA_FLUX_3H.1/\": {\n                            \"20250206001603\": 200\n                        },\n                        \"https://scholar.google.com/scholar?q=10.5067%2FAS9U6AWVTY69\": {\n                            \"20250206001603\": 429\n                        },\n                        \"https://acdisc.gsfc.nasa.gov/data/CMS/MICASA_FLUX_D.1/doc/MiCASA_README.pdf\": {\n                            \"20250206001603\": 200\n                        },\n                        \"https://acdisc.gsfc.nasa.gov/data/CMS/MICASA_FLUX_3H.1/\": {\n                            \"20250206001603\": 200\n                        },\n                        \"https://search.earthdata.nasa.gov/search?q=MICASA_FLUX_3H\": {\n                            \"20250206001603\": 200\n                        }\n                    },\n                    \"landingPage\": {\n                        \"https://doi.org/10.5067/AS9U6AWVTY69\": {\n                            \"20250206001603\": 200\n                        }\n                    }\n                }\n            },\n```\n\n### Assumed user steps to use these python scripts\n\nAs noted above, the lower level functions in the tools.py file could probably be\nreused for different analysis, but initially this is just targeting questions of \n_\"What datasets don't exist in a more recent data.json metadata file and do the download\nURLs associated with them 404?\"_\n\n#### 1. Find the data.json URL\n\nThis is less easy than expected. There used to be a listing of URLS where you could find\neach agency data.json, but I couldn't locate it. It is possible that page is no longer\npublic or no longer exists. \n\nExample URLS for live agency data.json:\n- https://data.nasa.gov/data.json\n- https://www.commerce.gov/sites/default/files/data.json\n- https://data.ed.gov/data.json\n\nThe data.gov resources page suggest agency's make them available at the following\nURL structure `https://www.agency.gov/data.json`. However, that's often probably\ndifficult as the front page of agencies tends to be produced from a content management\nplatform that doesn't allow for a direct download URL at that location.\n\nIt is possible, similar data might be extracted from the data.gov CKAN API and put into\na DCAT formatted JSON, but this hasn't been tried yet. Doing a web search and\nspending a few minutes to look around will likely help you find an agency's data.json\nURL in many cases.\n\n#### 2. Make changes to the config.yml\n\nIf you're adding a new agency to the yaml, copy the existing YAML structure and change these fields.\n\n1. Change `name` key's value\n2. Change `agency_live_data_json_link` key's value\n3. Change `base_agency_path:` key's value\n4. Change `dates_to_pull` key's value\n\n5. Change which agency python scripts run for in the config.yml field `agency_name_to_run_now`\n\nThe config.yml is set up roughly following the a structure such that multiple agencies\ncan be documented in the config.yml but only called one at a time. In the example below,\nthe data for NASA would be collected as it is the first agency listed and the value for\nthe key `agency_name_to_run_now` is one.\n\n```\nagencies:\n  - agency_name_to_run_now: 1\n  - NASA:\n    - {more key:values pairs here specific to the agency in question} \n  - ANOTHER_AGENCY:\n    - {more key:values pairs here specific to the agency in question} \n  - ANOTHER_AGENCY:\n    - {more key:values pairs here specific to the agency in question} \n```\n\n\n#### 3. (optional) run [`src/get_list_available_snapshots_in_wayback_machine.py`](./src/get_list_available_snapshots_in_wayback_machine.py)\n\nYou can get timestamps of when the Internet Archive's Wayback Machine has snapshots\nof a data.json in two ways. \n\n(A.) You can manually go to https://web.archive.org/ and put in the URL of the location that the\nlive agency data.json is found in the wayback search bar.\nAn example of a direct URL to the wayback page for NASA's data.json is [https://web.archive.org/web/20250000000000*/https://data.nasa.gov/data.json](https://web.archive.org/web/20250000000000*/https://data.nasa.gov/data.json).\nFrom there, you can see when snapshot's of the JSON occurred. \n\n![alt text](./assets/images/wayback_machine.png)\n\n(B.) Alternatively, you could run the\n[`src/get_list_available_snapshots_in_wayback_machine.py`](./src/get_list_available_snapshots_in_wayback_machine.py)\nscript by running from root in a terminal `python src/get_list_available_snapshots_in_wayback_machine.py`.\n\nYou will need an active python environment with the packages in the `requirements.txt` file for this.\nThis script will print a json in the `snapshots_available_in_archive` sub-folder with the same information.\n\n#### 4. Run [`src/harvest.py`](./src/harvest.py) \n\nIn a terminal from the top of the directory run `python src/harvest.py`. It will likely take a few\nminutes for the script to complete it's work. Information should be printed to the terminal as it runs.\n\n\n#### 5. Look at or visualize the results\n\nThese will be in the results folder `/data/{agency_name}/missing_keys/{date1}_vs_{date2}_missing_keysInSecondNoTitleMatch.json`\n\nEventually, a webpage may be build to make understanding the results easier.\n\n\n## Early analysis learnings\n\n### Missing datasets\n\n#### Data.json lags reality \n\nDepartment of Education and Department of Commerce (NOAA and others) don't have any datasets\nmissing in data.json across the Biden/Trump administration boundary. Given reports of at least\nsome NOAA datasets being taken down, at least temporarily, this may reflect data.json lagging\nreality, meaning updates to data.json likely occur some time after they dataset access has\nbeen removed or datasets no longer exist. \n\n#### Some removed dataset identifiers appear to be version updates or identifier changes with no other changes but not all\n\nOf the 3 agencies in the initial analysis set, only NASA's' data.json showed datasets\nidentifiers in the first snapshot not present in the second snapshot, totally 198 dataset identifiers.\nThis compares data.json snapshots on 2024-10-09 and 2025-02-07.\n\nOf those 198, it seems at least 33 were situations where the datset identifer changed but the\ntitle did not, suggesting these might just be identifer changes related to underlying system\nupdates or version updates that didn't change dataset title.\n\nOf those 198, another 32 had nearly the same dataset title with only the last part of the\ntitle changed in a way consistent with version updates. For example a long title and then \nthe end has \"-v3\" instead of \"-v2\". \n\nBoth of those situations likely do not reflect dataset removal so much as dataset evolution over time.\n\n148 of 198 did not fall into those two previous situation and require additional analysis to see\nif they are true dataset removals. Most of those 148 have landing page and distribution URls\nthat give 200 status messages or mostly do, suggesting that data is still available.\nOthers give 403 messages suggesting there is a need to sign in first, especially for\nsome earth data systems that assume a user is signed in. There is a smaller subset that\nmay indeed be gone but current programmatic checks here do not ensure this with high accuracy yet.\n\n##### In progress work to visualize comparison results\n\nThere is an incomplete process to build a webpage that visualizes information from the\nJSONs in a static webpage without needing to look at the admittedly big and complex\nJSONs themselves. See the `/visualizations/` directory. One issue to work around is that\nthe JSONs are large enough that this repository is using LFS (Git Large File Storage)\nto store them in the GitHub repository, this is fine, except when you attempt to have\nJavaScript read the JSONs in a deployed website, it then doesn't work as it is not a\nJSON any longer but a LFS reference. This means we'll probably need a python \nscript to be run that extract relevant high level metrics from the comparison\nJSON and then put that data into a file or file format that doesn't get converted\nto LFS, perhaps a YAML file?\n\n_This work is in flux._\n\n\n### Possible future analysis pathways suggested by this analysis\n \nSome of the missing dataset's whose metadata doesn't appear in the more recent data.json\nhave several dataset distribution URLs listed as well as a landing page. For several of the\ndatasets, only some of the URLs return 404 statuses and others 200 or other statues.\nThere might be value in analyzing those statuses and seeing if it possible to\nprogrammatically recognize when a landing page might be down but a download\ndistribution URL still be up as those might be targets for archiving.\n\n## Contributing \n\nThis repository is currently being used for experiments to see how much use\nanalysis of agency level data.json might be for understanding availability of\nU.S. government federal open data. It is a very narrow analysis.\nThere are lots of other efforts in the broader space. \n\nIf you have questions or want to get in touch on\nrelated efforts/analysis leave an issue on this repository or find my contact email.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjustingosses%2Fdata_dot_json_over_time","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjustingosses%2Fdata_dot_json_over_time","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjustingosses%2Fdata_dot_json_over_time/lists"}