https://github.com/agrc/sweeper
A CLI tool for ensuring high quality data 🧹
https://github.com/agrc/sweeper
government-app scheduled-tool spatial-data-life-cycle terraform-managed
Last synced: 3 months ago
JSON representation
A CLI tool for ensuring high quality data 🧹
- Host: GitHub
- URL: https://github.com/agrc/sweeper
- Owner: agrc
- License: mit
- Created: 2019-07-25T22:00:21.000Z (about 6 years ago)
- Default Branch: main
- Last Pushed: 2025-01-27T20:27:01.000Z (9 months ago)
- Last Synced: 2025-07-01T01:49:03.146Z (3 months ago)
- Topics: government-app, scheduled-tool, spatial-data-life-cycle, terraform-managed
- Language: Python
- Homepage:
- Size: 2.2 MB
- Stars: 4
- Watchers: 7
- Forks: 3
- Open Issues: 16
-
Metadata Files:
- Readme: readme.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# ugrc-sweeper [](https://badge.fury.io/py/ugrc-sweeper)[](https://github.com/agrc/sweeper/actions/workflows/push.yml)
The data cleaning service.

## Available Sweepers
### Addresses
Checks that addresses have minimum required parts and optionally normalizes them.
### Duplicates
Checks for duplicate features.
### Empties
Checks for empty geometries.
### Metadata
Checks to make sure that the metadata meets [the Basic SGID Metadata Requirements](https://gis.utah.gov/about/policy/metadata/#basic-sgid-metadata).
#### Tags
Checks to make sure that existing tags are cased appropriately. This mean that the are title-cased other than known abbreviations (e.g. UGRC, BLM) and articles (e.g. a, the, of).
This check also verifies that the data set contains a tag that matches the database name (e.g. `SGID`) and the schema (e.g. `Cadastre`).
`--try-fix` adds missing required tags and title-cases any existing tags.
#### Summary
Checks to make sure that the summary is less than 2048 characters (a limitation of AGOL) and that it is shorter than the description.
#### Description
Checks to make sure that the description contains a link to a data page on gis.utah.gov.
#### Use Limitations
Checks to make sure that the text in this section matches the [official text for UGRC](src/sweeper/sweepers/UseLimitations.html).
`--try-fix` updates the text to match the official text.
## Parsing Addresses
This project contains a module that can be used as a standalone address parser, `sweeper.address_parser`. This allows developer to take advantage of sweepers advanced address parsing and normalization without having to run the entire sweeper process.
### Usage Example
```python
from sweeper.address_parser import Addressaddress = Address('123 South Main Street')
print(address)'''
--> Parsed Address:
{'address_number': '123',
'normalized': '123 S MAIN ST',
'prefix_direction': 'S',
'street_name': 'MAIN',
'street_type': 'ST'}
'''
```### Available Address class properties
All properties default to None if there is no parsed value.
`address_number`
`address_number_suffix`
`prefix_direction`
`street_name`
`street_direction`
`street_type`
`unit_type`
`unit_id`
If no `unit_type` is found, this property is prefixed with `#` (e.g. `# 3`). If `unit_type` is found, `#` is stripped from this property.`city`
`zip_code`
`po_box`
The PO Box if a po-box-type address was entered (e.g. `po_box` would be `1` for `p.o. box 1`).`normalized`
A normalized string representing the entire address that was passed into the constructor. PO Boxes are normalized in this format `PO BOX `.## Installation (requires Pro 2.7+)
1. clone arcgis conda environment
- `conda create --name sweeper --clone arcgispro-py3`
1. activate environment
- `activate sweeper`
1. install sweeper
- `pip install ugrc-sweeper`
1. Optionally duplicate `config.sample.json` as `config.json` in the folder where you will run sweeper.> [!CAUTION]
> This is required for the following functions:
>
> - `--scheduled` argument (required for sending emails)
> - `--change-detect` argument
> - using user-specific connection files via the `CONNECTIONS_FOLDER` config value## Exclusions
Tables can be skipped by adding values to the `EXCLUSIONS.` config array. These values are matched against table names using [fnmatch](https://docs.python.org/3/library/fnmatch.html#fnmatch.fnmatch). Note that these do not apply when using the `--table-name` argument.
## Development
1. clone arcgis conda environment
- `conda create --name sweeper --clone arcgispro-py3`
1. activate environment
- `activate sweeper`
1. install required dependencies to work on sweeper
- `pip install -e ".[tests]"`
1. `test_metadata.py` uses a SQL database that needs to be restored via `src/sweeper/tests/data/Sweeper.bak` to your local SQL Server.
1. run sweeper: `sweeper`
1. test: `pytest`
1. lint: `ruff check .`
1. format: `ruff format .`