{"id":15945560,"url":"https://github.com/aktech/nuforc_sightings_data","last_synced_at":"2025-04-03T22:14:17.980Z","repository":{"id":81552409,"uuid":"521315253","full_name":"aktech/nuforc_sightings_data","owner":"aktech","description":"Data collection and processing for the National UFO Reporting Center (NUFORC) database.","archived":false,"fork":false,"pushed_at":"2022-08-06T00:09:30.000Z","size":57,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-09T10:11:37.751Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aktech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-04T15:16:19.000Z","updated_at":"2022-08-06T00:22:40.000Z","dependencies_parsed_at":null,"dependency_job_id":"3086fe9c-d4f8-4789-a4d9-27b0445ef5c1","html_url":"https://github.com/aktech/nuforc_sightings_data","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aktech%2Fnuforc_sightings_data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aktech%2Fnuforc_sightings_data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aktech%2Fnuforc_sightings_data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aktech%2Fnuforc_sightings_data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aktech","download_url":"https://codeload.github.com/aktech/nuforc_sightings_data/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247086024,"owners_count":20881160,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-07T09:03:49.208Z","updated_at":"2025-04-03T22:14:17.958Z","avatar_url":"https://github.com/aktech.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NUFORC Sighting Reports\n\nThis repository is forked from [timothyrenner/nuforc_sightings_data/](https://github.com/timothyrenner/nuforc_sightings_data/)\n\nThe Nationa UFO Research Center ([NUFORC](http://www.nuforc.org/)) maintains an online database of over 100,000 UFO sightings including city, shape, and a text description.\nThis project contains the code necessary to collect the data in the database, perform some standardization and cleaning, and geocode the sightings at the city/state level.\n\n## Quickstart\n\n**NOTE** Requires the Anaconda python distribution.\n\n**NOTE** Requires the [Maxmind GeoLite2](https://dev.maxmind.com/geoip/geolite2-free-geolocation-data?lang=en), which is free, but requires an account.\nDownload the zip file to `data/external`.\n\nTo get started, cd into the project root and run\n\n```shell\nconda env create -f environment.yaml\nconda activate nuforc\npip install -r requirements.txt\n\n# Takes a very long time - about 3-4 hours.\ndvc repro\n```\n\nThis downloads two datasets: the raw reports as line delimited JSON and a CSV file that contains refined fields (i.e. standardized / corrected states, cities, shapes, etc) as well as additional fields for the latitude and longitude of the sighting at the city level for most of the sightings.\n\n## Raw Reports\n\nThe raw reports are pulled into `data/raw/nuforc_reports.json` as line delimited JSON.\nThe reports take a long time to download because there are a lot of them and because the scraper is throttled so as not to hit the NUFORC server very hard.\nAlso, the target doesn't simply pull the data, it also merges it with any past data.\nThat way if any reports are removed from the NUFORC site (it's happened in the past), they will persist as long as they were pulled at some point.\nEach record has the following schema:\n\n```javascript\n{\n    // Full text of the report.\n    \"text\": string,\n\n    // Summary stats as a string.\n    \"stats\": string, \n    \n    // The date-time as it appears in the report (M/DD/YY HH:MM).\n    \"date_time\": string, \n\n    // URL to the original report.\n    \"report_link\": string, \n\n    // City name.\n    \"city\": string,\n\n    // State as 2 character code.\n    \"state\": string,\n\n    // The shape of the object.\n    \"shape\": string,\n\n    // The duration of the sighting in no particular format.\n    \"duration\": string,\n\n    // A summary of the sighting. Seems to be just the first few sentences of \n    // the full report.\n    \"summary\": string,\n\n    // The date the sighting was posted to the NUFORC site as M/DD/YY.\n    \"posted\": string\n}\n```\n\nIn this file all fields are as they appear on the site.\nThe scraper does no parsing except on the HTML elements to extract the content text.\n\n## Enhanced Reports\n\nThe enhanced reports in CSV format are stored in `data/processed/nuforc_reports.csv`.\nThe data standardizations are as follows:\n\n* Parse and standardize date-time values to ISO 8601 where possible (null when not possible).\n* Standardize shape captilization plus a few minor merges (circular -\u003e circle, etc).\n* Standardize the state codes to capital case and fix obvious defects and misprints.\n* Standardize the cities (i.e. Ft -\u003e Fort, St -\u003e Saint, etc) and remove irrelevant characters like parentheses.\n\nAll of the data cleaning code can be found in `scripts/process_report_data.py`.\nThe code is straightforward and pretty well commented.\n\nIn addition to the standardization, the reports are also geocoded where possible.\nMost of the reports (~90k out of ~110k) were able to find a match by the geocoder, which uses the [MaxMind](https://dev.maxmind.com/geoip/geoip2/geolite2/) GeoLite2 database as a lookup mechanism for lat/lon.\nSince there's no country information, pretty much all of the geocoded reports are in US and Canada.\nMost of the \"leftovers\" are either outside US/Canada or pretty much impossible to geocode accurately (i.e. \"rural, CA\" or \"Unknown location (military video)\").\n\nHere's the schema for the file:\n\n| column name      | description                                                    |\n| ---------------- | -------------------------------------------------------------- |\n| `summary`        | The summary of the report (usually first couple of sentences). |\n| `city`           | The city where the sighting occurred.                          |\n| `state`          | The 2 character state code where the sighting occurred.        |\n| `date_time`      | The date / time of the sighting in ISO 8601.                   |\n| `shape`          | The shape of the object.                                       |\n| `duration`       | The duration of the sighting in no particular format.          |\n| `stats`          | Key stats about the report.                                    |\n| `report_link`    | Link to the original report on the NUFORC site.                |\n| `text`           | The full text of the report.                                   |\n| `posted`         | The date the sighting was posted in ISO 8601.                  |\n| `city_latitude`  | The latitude of the city of the sighting.                      |\n| `city_longitude` | The longitude of the city of the sighting.                     |\n## Other Notes\n\nThis product uses GeoLite2 data created by MaxMind, available from\n\u003ca href=\"http://www.maxmind.com\"\u003ehttp://www.maxmind.com\u003c/a\u003e.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faktech%2Fnuforc_sightings_data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faktech%2Fnuforc_sightings_data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faktech%2Fnuforc_sightings_data/lists"}