Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/barthoekstra/brc-data-preprocessor
The data preprocessor checks the raw Batumi Raptor Count data coming straight from the Trektellen database. It flags records containing possibly erroneous or suspicious information, but does not delete any data. It is up to coordinators and data technicians to decide what to do with the flagged records.
https://github.com/barthoekstra/brc-data-preprocessor
Last synced: 5 days ago
JSON representation
The data preprocessor checks the raw Batumi Raptor Count data coming straight from the Trektellen database. It flags records containing possibly erroneous or suspicious information, but does not delete any data. It is up to coordinators and data technicians to decide what to do with the flagged records.
- Host: GitHub
- URL: https://github.com/barthoekstra/brc-data-preprocessor
- Owner: barthoekstra
- License: gpl-3.0
- Created: 2019-05-15T19:15:26.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-08-03T06:51:13.000Z (over 1 year ago)
- Last Synced: 2024-06-11T16:36:49.327Z (5 months ago)
- Language: Python
- Homepage: https://www.batumiraptorcount.org
- Size: 65.4 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# brc-data-preprocessor
The data preprocessor checks the raw [Batumi Raptor Count](https://www.batumiraptorcount.org) data coming straight from the [Trektellen](https://www.trektellen.org) database. It flags records containing possibly erroneous or suspicious information, but *does not delete any data*. It is up to coordinators and data technicians to decide what to do with the flagged records.Author: Bart Hoekstra | Mail: [[email protected]](mailto:[email protected])
## General workflow
The preprocessor runs on [Amazon Lambda](https://aws.amazon.com/lambda/) and regularly checks the [Trektellen](https://www.trektellen.org) site for newly uploaded [BRC counts](https://www.batumiraptorcount.org/migration-count-data). If both stations have uploaded data for the day, the fetcher will download the data and store a raw version of the data in Dropbox (in e.g. `2019/data/raw`). The preprocessor subsequently checks a copy of the raw data for all kinds of possible errors and flags them by adding a description of the potential problem to a `check` column in the file stored in `2019/data/inprogress`. It is then up to coordinators to use their experience and knowledge of the migration during a given day to determine the validity of the flags added by the preprocessor and act accordingly. Once they have dealt with these issues and emptied the `check` column of flags, the file can be moved to `2019/data/clean`. A copy of the checked file gets stored in `2019/data/inprogress-backup`, so data technicians can check how changes to the data have been made.## Flagged records
The following records will be flagged by the preprocessor:
- Records with invalid doublecount entries (e.g. not within 10 minutes or with the wrong distance code).
- Records containing >1 bird that is injured and/or killed (rare occurrence).
- Records lacking critical information in `datetime`, `telpost`, `speciesname`, `count` or `location` columns (very unlikely, but the possible result of a bug).
- Records of birds in >E3 (rare occurrence).
- Records with registered morphs for all species other than Booted Eagles (and Eleonora's Falcons).
- Records of `HB_NONJUV`, `HB_JUV`, `BK_NONJUV` and `BK_JUV` if the number of aged birds is higher than the number of counted birds (`HB` and `BK`) within a 10-minute window around the age record.
- Records of Honey Buzzards that should probably be single-counted (at Station 2 during the HB focus period).
- Records of aged Honey Buzzards and Black Kites outside of expected distance codes (i.e. outside of W1-O-E1).
- Records containing unexpected combinations of sex and/or age information.
- Records with no timestamps, which are set to 00:00:00 during processing.
- Records containing non-protocol species.
- Records with age details in `W3`, `E3` and `>E3`, excluding non-juvenile harriers with a sex, juvenile `MonPalHen` and juvenile/non-juvenile eagles.
- Records of female Pallid Harriers with `I` or `A` age (legal per protocol, though very difficult to age in the field).## Todo
- [x] Implement automatic download of the data, flagging of suspicious records and storing of the data in Dropbox using AWS Lambda.
- [x] Automatically add `START` and `END` records to fetched data based on count start and end times.## Future additions
- [ ] Implement checks for possibly erroneous records based on some statistical rules, e.g. the expected (daily) phenology of a species.## Build Lambda deployment Docker image (requires Docker and AWS CLI)
1. Clone this repository.
2. `cd` into this directory.
3. Build the [Docker](https://docs.docker.com/install/) image to generate a deployment image for the function.
```
docker build --platform linux/amd64 -t brc-data-preprocessor-docker:v1 .
```
4. Tag docker image. Replace XXXXXX with your account ID.
```
docker tag brc-data-preprocessor-docker:v1 XXXXXX.dkr.ecr.eu-central-1.amazonaws.com/brc-data-preprocessor-docker:latest
```
5. Push docker image to Amazon container repository. Replace XXXXXX with your account ID.
```
docker push XXXXXX.dkr.ecr.eu-central-1.amazonaws.com/brc-data-preprocessor-docker:latest
```
6. Update function. Replace XXXXXX with your account ID.
```
aws lambda update-function-code --function-name brc-data-preprocessor-docker \
--image-uri XXXXXX.dkr.ecr.eu-central-1.amazonaws.com/brc-data-preprocessor-docker:latest
```