{"id":18055339,"url":"https://github.com/agentmorris/usgs-geese","last_synced_at":"2025-04-10T23:13:54.349Z","repository":{"id":174766572,"uuid":"652752051","full_name":"agentmorris/usgs-geese","owner":"agentmorris","description":"Code for training and evaluating a detector for the USGS Izembek goose survey dataset","archived":false,"fork":false,"pushed_at":"2025-03-05T01:01:00.000Z","size":1483,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-10T23:13:49.089Z","etag":null,"topics":["birds","computer-vision","conservation","machine-learning","object-detection","wildlife"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/agentmorris.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-12T18:12:43.000Z","updated_at":"2025-03-05T01:01:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"7768f31e-1906-4461-84ec-b3e7820aa5d8","html_url":"https://github.com/agentmorris/usgs-geese","commit_stats":{"total_commits":51,"total_committers":2,"mean_commits":25.5,"dds":0.4509803921568627,"last_synced_commit":"e804c2147e738127d1ae5c9965fb026e2e580cd3"},"previous_names":["agentmorris/usgs-geese"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agentmorris%2Fusgs-geese","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agentmorris%2Fusgs-geese/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agentmorris%2Fusgs-geese/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agentmorris%2Fusgs-geese/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/agentmorris","download_url":"https://codeload.github.com/agentmorris/usgs-geese/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248312135,"owners_count":21082638,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["birds","computer-vision","conservation","machine-learning","object-detection","wildlife"],"created_at":"2024-10-31T00:14:38.168Z","updated_at":"2025-04-10T23:13:54.315Z","avatar_url":"https://github.com/agentmorris.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Izembek Brant Goose Detector\n\n### Overview\n\nThe code in this repo trains, runs, and evaluates models to detect geese in aerial images, based on the \u003ca href=\"https://www.usgs.gov/data/aerial-photo-imagery-fall-waterfowl-surveys-izembek-lagoon-alaska-2017-2019\"\u003eIzembek Lagoon dataset\u003c/a\u003e (complete citation [below](#data-source)).  That dataset is also [publicly available on LILA](https://lila.science/datasets/izembek-lagoon-waterfowl/).\n\nImages were originally annotated with points, labeled as brant, Canada, gull, emperor, and other.  The goal is accuracy on brant, which is by far the most common class (there are around 400k \"brant\" points, and less than 100k of everything else combined).  \n\nThere are around 100,000 images total, about 95% of which contain no geese.  Images are 8688 x 5792.  A typical ground truth image looks like this:\n\n\u003cimg src=\"sample_image.jpg\" width=\"800px;\"\u003e\u003cbr/\u003e\n\nThe annotations you can vaguely see as different colors correspond to different species of goose.  Most of this repo operates on 1280x1280 patches that look like this:\n\n\u003cimg src=\"annotated_patch.png\" width=\"800px;\"\u003e\u003cbr/\u003e\n\n### Sample results\n\nHere's a random patch of predictions, but you should \u003ci\u003enever\u003c/i\u003e put any stock into a \"random\" image of results that someone shows you on the Internet:\n\n\u003cimg src=\"sample_results_patch.jpg\" width=\"800px;\"\u003e\u003cbr/\u003e\n\nMaybe all the results really look like that, maybe they don't.  I pinky-swear that the image from which this patch was cropped was not used in training, and that in general the results really do look like this, but... never trust random results on the Internet.\n\nIf you want to dig a little deeper, here is a set of patch-level previews for validation data (patches selected from images excluded from training, but from flights that were included in training) or test data (patches selected from flights that were excluded from training):\n\n\u003chttps://lila.science/public/usgs-izembek-results/\u003e\n\nNB: \u003cb\u003ethose results are from the slightly-buggy 1.0.0 model\u003c/b\u003e; we didn't bother to generate the result previews again when we released an updated 1.1.0 model.  In as much as there's a list, re-generating those results is on the list.\n\n### Files\n\nThese are listed in roughly the order in which you would use them.\n\n#### usgs-geese-data-import.py\n\n* Match images to annotation files\n* Read the original annotations (in the format exported by [CountThings](https://countthings.com/))\n* Convert to COCO format\n* Do a bunch of miscellaneous consistency checking\n\n#### usgs-geese-training-data-prep.py\n\n* For all the images with at least one annotation, slice into mostly-non-overlapping patches\n* Optionally sample hard negatives (I did not end up actually using any hard negatives)\n* Split into train/val\n* Export to YOLO annotation format\n\n#### usgs-geese-training.py\n\n* Train the model (training happens at the YOLOv5 CLI, but this script documents the commands)\n* Run the YOLOv5 validation scripts\n* Convert YOLOv5 val results to MD .json format\n* Example code to use the MD visualization pipeline to visualize results\n* Example code to use the MD inference pipeline to run the trained model\n\n#### usgs-geese-inference.py\n\n* Run inference on a folder of images, which means, for each image:\n\n    * Split the image into overlapping patches\n    * Run inference on each patch\n    * Resolve redundant detections\n    * Convert YOLOv5 output to .json (in MegaDetector format)\n\n#### usgs-geese-postprocessing.py\n\n* Generate patch-level previews from image-level model results\n* Generate estimated image-level bird counts from image-level model results (and write to .csv)\n\n#### run_izembek_model.py\n\nThis is the main command-line entry point for inference; this is basically a command-line driver for usgs-geese-inference.py.\n\n## Running the model\n\nThis section describes the environment setup and command line process for running inference.  Training is not yet set up to be fully run from the command line, though it's close, and the environment should be the same.  Much of the code and environment is borrowed from [MegaDetector](https://github.com/agentmorris/MegaDetector), a model that does similar stuff for camera trap images.\n\n### Environment setup\n\n#### 1. Install prerequisites: Mambaforge, Git, and NVIDIA stuff\n\nInstall prerequisites according to the [MegaDetector instructions for prerequisite setup](https://github.com/agentmorris/MegaDetector/blob/main/megadetector.md#1-install-prerequisites-mambaforge-git-and-nvidia-stuff).  If you already have Mambaforge, git, and the latest NVIDIA driver installed, nothing to see here.\n\n#### 2. Download the model file\n\nDownload the [Izembek bird detector](https://github.com/agentmorris/usgs-geese/releases/download/v1.1.0/usgs-geese-yolov5x-230820-b8-img1280-e200-best.pt) to your computer.  It can be anywhere that's convenient, you'll specify the full path to the model file later.\n\n#### 3. Clone git repos and set up your Python environment\n\nYou will need the contents of this repo and the [YOLOv5 repo](https://github.com/ultralytics/yolov5), and you will also need to set up a Python environment with all the Python packages that our code depends on.  In this section, we provide \u003ca href=\"#windows-instructions-for-gitpython-stuff\"\u003eWindows\u003c/a\u003e, \u003ca href=\"#linux-instructions-for-gitpython-stuff\"\u003eLinux\u003c/a\u003e, and \u003ca href=\"#mac-instructions-for-gitpython-stuff\"\u003eMac\u003c/a\u003e instructions for doing all of this stuff.\n\n##### Windows instructions for git/Python stuff\n\nThe first time you set all of this up, open your Mambaforge prompt, and run:\n\n```batch\nmkdir c:\\git\ncd c:\\git\ngit clone https://github.com/ultralytics/yolov5\ngit clone https://github.com/agentmorris/usgs-geese\ncd c:\\git\\usgs-geese\nmamba create -n usgs-geese-inference python=3.11 pip -y\nmamba activate usgs-geese-inference\npip install -r requirements.txt\n```\n\n\u003ca name=\"windows-new-shell\"\u003e\u003c/a\u003e\nYour environment is set up now!  In the future, when you open your Mambaforge prompt, you only need to run:\n\n```batch\ncd c:\\git\\usgs-geese\nmamba activate usgs-geese-inference\n```\n\n##### Linux/Mac instructions for git/Python stuff\n\nIf you have installed Mambaforge on Linux or MacOS, you are probably always at an Mambaforge prompt; i.e., you should see \"(base)\" at your command prompt.  Assuming you see that, the first time you set all of this up, and run:\n\n```batch\nmkdir ~/git\ncd ~/git\ngit clone https://github.com/ultralytics/yolov5\ngit clone https://github.com/agentmorris/usgs-geese\ncd ~/git/usgs-geese\nmamba create -n usgs-geese-inference python=3.11 pip -y\nmamba activate usgs-geese-inference\npip install -r requirements.txt\n```\n\n\u003ca name=\"linux-new-shell\"\u003e\u003c/a\u003e\nYour environment is set up now!  In the future, whenever you start a new shell, you just need to do:\n\n```batch\ncd ~/git/usgs-geese\nmamba activate usgs-geese-inference\n```\n\n### Actually running the model\n\nYou can run the model with [run_izembek_model.py](run_izembek_model.py).  First, when you open a new Mambaforge prompt, don't forget to do this (on Windows):\n\n```batch\ncd c:\\git\\usgs-geese\nmamba activate usgs-geese-inference\n```\n\n...or this (on Linux/Mac):\n\n```batch\ncd ~/git/usgs-geese\nmamba activate usgs-geese-inference\n```\n\nThen you can run the script like this (using Windows syntax), substituting real paths for all the arguments:\n\n```batch\npython run-izembek-model.py [MODEL_PATH] [IMAGE_FOLDER] [YOLO_FOLDER] [SCRATCH_FOLDER] --recursive --no_use_symlinks\n```\n\n* MODEL_PATH is the full path to the .pt you downloaded earlier, e.g. \"c:\\models\\usgs-geese-yolov5x-230820-b8-img1280-e200-best.pt\"\n* IMAGE_FOLDER is the root folder of all the images you want to process (recursively, if you specify \"--recursive\")\n* YOLO_FOLDER is the folder where you checked out the YOLOv5 repo, e.g. \"c:\\git\\yolov5\"\n* SCRATCH_FOLDER is a folder you have permission to write to, on a drive that has at least twice as much free space as the size of the image folder\n\nThe \"--no_use_symlinks\" argument tells the script not to attempt symbolic link creation.  We use symbolic links at once step to minimize temporary disk space use, but this requires admin privileges on Windows, so if you're running on Windows and don't have admin privileges, use the \"--no_use-symlinks\" option.\n\nYou can see a full list of options by running:\n\n`python run-izembek-model.py --help`\n\nIf you have a GPU, and it's being utilized correctly, near the beginning of the output, you should see:\n\n`GPU available: True`\n\nIf you have an Nvidia GPU, and it's being utilized correctly, near the beginning of the output, you should see:\n\n`GPU available: True`\n\nIf you have an Nvidia GPU and you see \"GPU available: False\", your GPU environment may not be set up correctly.  95% of the time, this is fixed by \u003ca href=\"https://www.nvidia.com/en-us/geforce/drivers/\"\u003eupdating your Nvidia driver\"\u003c/a\u003e and rebooting.  If you have an Nvidia GPU, and you've installed the latest driver, and you've rebooted, and you're still seeing \"GPU available: False\", \u003ca href=\"mailto:agentmorris+izembek@gmail.com\"\u003eemail me\u003c/a\u003e.\n\n### Where do the results go?\n\nIf everything worked correctly, in your scratch folder, there will be a folder called \"image_level_results\".  Within that, look for a file that looks like:\n\n`something_something_something_md_results_image_level_nms.json`\n\nThe first bit (something something something) corresponds to the folder name you just processed.  The idea is that you will use the same scratch folder every time, so this part gives the results files a unique name.  This .json file contains the locations of all detections, in the [MegaDetector results format](https://github.com/agentmorris/MegaDetector/tree/main/megadetector/api/batch_processing#megadetector-batch-output-format).\n\n### Previewing the results\n\nTo generate preview pages like the \u003ca href=\"https://lila.science/public/usgs-izembek-results/\"\u003esamples linked to above\u003c/a\u003e, use:\n\n```batch\npython izembek-model-postprocessing.py [RESULTS_FILE] --image_folder [IMAGE_FOLDER] --preview_folder [PREVIEW_FOLDER] --n_patches 100 --confidence_thresholds 0.5 0.6 --open_preview_pages\n```\n\nThe values \"1000\" and \"0.5 0.6\" are just examples.\n\n* RESULTS_FILE is the full path to the .json results file produced during inference\n* IMAGE_FOLDER is the root folder on which you ran the model\n* PREVIEW_FOLDER is the folder where you want to write the preview pages\n* --n_patches specifies the number of 1280x1280 patches to sample for the preview.  100 is a good number just to make sure everything is working, but assuming you have a very high fraction of empty patches, 3000 is a good minimum number to really get the gestalt of the results.\n* --confidence_thresholds is a (space-separated) list of confidence thresholds to generate preview pages for.  The string \"0.5 0.6\" is just an example.\n* --open_preview_pages will cause all the preview pages to open in your browser when the script is done\n\n### Generating counts\n\nTo generate a .csv file with per-species counts for each image, use:\n\n```batch\npython izembek-model-postprocessing.py [RESULTS_FILE] --count_file [COUNT_FILE] --confidence_thresholds 0.5 0.6\n```\n\n* RESULTS_FILE is the full path to the .json results file produced during inference\n* COUNT_FILE is the .csv file to which you want to write the resulting counts \n* --confidence_thresholds is a (space-separated) list of confidence thresholds to generate counts for.  The string \"0.5 0.6\" is just an example.\n\n### Random errors and how to fix them\n\n#### Getting the latest version of this repo\n\nIf something isn't working as expected, make sure you have the latest version of this repo, by running:\n\n```batch\ncd c:\\git\\usgs-geese\ngit fetch\ngit pull\n```\n\n#### SSL errors when running the model for the first time\n\nIf you get a bunch of errors that look like this:\n\n`WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)'))': /simple/gitpython/`\n\n...try this:\n\n```bash\npip install --trusted-host pypi.org --trusted-host files.pythonhosted.org \"gitpython\u003e=3.1.30\"\npip install --trusted-host pypi.org --trusted-host files.pythonhosted.org \"setuptools\u003e=65.5.1\"\n```\n\n...then try running the model again.\n\n## Data source\n\nAll images are sampled from:\n\nWeiser EL, Flint PL, Marks DK, Shults BS, Wilson HM, Thompson SJ, Fischer JB, 2022, Aerial photo imagery from fall waterfowl surveys, Izembek Lagoon, Alaska, 2017-2019: U.S. Geological Survey data release, \u003ca href=\"https://doi.org/10.5066/P9UHP1LE\"\u003ehttps://doi.org/10.5066/P9UHP1LE\u003c/a\u003e.\n\n## Open issues\n\n### Training\n\n* We tried YOLOv8x in place of YOLOv5x6.  In both cases, we used \"all the pixels\", i.e. we used a window size that matched the input resolution of the model (640px and 1280px, respectively).  YOLOv8x was slightly worse than YOLOv5x6.  It would be informative to try YOLOv9-e and YOLOv10x.\n\n### Inference\n\n* Add checkpointing, currently you lose the whole result set if your job crashes.  A reasonable alternative to checkpointing is just automatically dividing up the job into lots of smaller jobs.\n\n* Clean up the extensive scratch space use, especially when running without admin priveleges on Windows, where we create patches, then copy all of those patches because we can't create symlinks\n\n* Patch generation should have an overwrite=False option, to avoid re-generating patches we already have.  When running long inference jobs, patch generation is maybe 5% of the overall time, but it's annoying to have to re-do this if the job crashes (e.g., if your neighborhood's power randomly goes out 95% of the way into a big inference job) (sigh).\n\n### Postprocessing\n\n* Allow confidence thresholds to vary by class (for both counting and preview generation, but especially for preview generation)\n\n* Parallelize patch generation in usgs-geese-postprocessing.py\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagentmorris%2Fusgs-geese","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fagentmorris%2Fusgs-geese","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagentmorris%2Fusgs-geese/lists"}