An open API service indexing awesome lists of open source software.

https://github.com/agrc/sweeper

A CLI tool for ensuring high quality data 🧹
https://github.com/agrc/sweeper

government-app scheduled-tool spatial-data-life-cycle terraform-managed

Last synced: 3 months ago
JSON representation

A CLI tool for ensuring high quality data 🧹

Awesome Lists containing this project

README

          

# ugrc-sweeper [![PyPI version](https://badge.fury.io/py/ugrc-sweeper.svg)](https://badge.fury.io/py/ugrc-sweeper)[![Push Events](https://github.com/agrc/sweeper/actions/workflows/push.yml/badge.svg)](https://github.com/agrc/sweeper/actions/workflows/push.yml)

The data cleaning service.

![sweeper_sm](https://user-images.githubusercontent.com/325813/90411835-91c4c080-e069-11ea-9d03-f3e60421b835.png)

## Available Sweepers

### Addresses

Checks that addresses have minimum required parts and optionally normalizes them.

### Duplicates

Checks for duplicate features.

### Empties

Checks for empty geometries.

### Metadata

Checks to make sure that the metadata meets [the Basic SGID Metadata Requirements](https://gis.utah.gov/about/policy/metadata/#basic-sgid-metadata).

#### Tags

Checks to make sure that existing tags are cased appropriately. This mean that the are title-cased other than known abbreviations (e.g. UGRC, BLM) and articles (e.g. a, the, of).

This check also verifies that the data set contains a tag that matches the database name (e.g. `SGID`) and the schema (e.g. `Cadastre`).

`--try-fix` adds missing required tags and title-cases any existing tags.

#### Summary

Checks to make sure that the summary is less than 2048 characters (a limitation of AGOL) and that it is shorter than the description.

#### Description

Checks to make sure that the description contains a link to a data page on gis.utah.gov.

#### Use Limitations

Checks to make sure that the text in this section matches the [official text for UGRC](src/sweeper/sweepers/UseLimitations.html).

`--try-fix` updates the text to match the official text.

## Parsing Addresses

This project contains a module that can be used as a standalone address parser, `sweeper.address_parser`. This allows developer to take advantage of sweepers advanced address parsing and normalization without having to run the entire sweeper process.

### Usage Example

```python
from sweeper.address_parser import Address

address = Address('123 South Main Street')
print(address)

'''
--> Parsed Address:
{'address_number': '123',
'normalized': '123 S MAIN ST',
'prefix_direction': 'S',
'street_name': 'MAIN',
'street_type': 'ST'}
'''
```

### Available Address class properties

All properties default to None if there is no parsed value.

`address_number`

`address_number_suffix`

`prefix_direction`

`street_name`

`street_direction`

`street_type`

`unit_type`

`unit_id`
If no `unit_type` is found, this property is prefixed with `#` (e.g. `# 3`). If `unit_type` is found, `#` is stripped from this property.

`city`

`zip_code`

`po_box`
The PO Box if a po-box-type address was entered (e.g. `po_box` would be `1` for `p.o. box 1`).

`normalized`
A normalized string representing the entire address that was passed into the constructor. PO Boxes are normalized in this format `PO BOX `.

## Installation (requires Pro 2.7+)

1. clone arcgis conda environment
- `conda create --name sweeper --clone arcgispro-py3`
1. activate environment
- `activate sweeper`
1. install sweeper
- `pip install ugrc-sweeper`
1. Optionally duplicate `config.sample.json` as `config.json` in the folder where you will run sweeper.

> [!CAUTION]
> This is required for the following functions:
>
> - `--scheduled` argument (required for sending emails)
> - `--change-detect` argument
> - using user-specific connection files via the `CONNECTIONS_FOLDER` config value

## Exclusions

Tables can be skipped by adding values to the `EXCLUSIONS.` config array. These values are matched against table names using [fnmatch](https://docs.python.org/3/library/fnmatch.html#fnmatch.fnmatch). Note that these do not apply when using the `--table-name` argument.

## Development

1. clone arcgis conda environment
- `conda create --name sweeper --clone arcgispro-py3`
1. activate environment
- `activate sweeper`
1. install required dependencies to work on sweeper
- `pip install -e ".[tests]"`
1. `test_metadata.py` uses a SQL database that needs to be restored via `src/sweeper/tests/data/Sweeper.bak` to your local SQL Server.
1. run sweeper: `sweeper`
1. test: `pytest`
1. lint: `ruff check .`
1. format: `ruff format .`