Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rectangletangle/atypical
Find the junk data hidden amongst the good data (Python 3.4)
https://github.com/rectangletangle/atypical
Last synced: 3 months ago
JSON representation
Find the junk data hidden amongst the good data (Python 3.4)
- Host: GitHub
- URL: https://github.com/rectangletangle/atypical
- Owner: rectangletangle
- License: bsd-2-clause
- Created: 2014-10-15T08:23:54.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2014-10-24T21:28:29.000Z (about 10 years ago)
- Last Synced: 2024-05-27T12:08:50.136Z (6 months ago)
- Language: Python
- Homepage:
- Size: 285 KB
- Stars: 7
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
atypical
========Find the junk data hidden amongst the good data (Python 3.4)
Automatically identifying and removing low quality data is important whenever
dealing with large quantities of organically generated information. Many fields
can have a reasonable level of quality enforced by simply using a regex, e.g.,
URLs, email addresses, phone numbers. However ensuring quality with data that
doesn't have a strict format or syntax can be much trickier. This library uses
a combination of the Markov property and character proportions to infer which
data points are the most out of place.## Usage:
This example prints the strings ordered by how typical they are relative to the
other strings. `'ax'` is the least typical, while `'ab'` is the most typical.```python
>>> from atypical import atypical
>>> scores = atypical(['abb', 'ax', 'ab', 'ab', 'abc'])
>>> list(scores.rounded())
[(-1.457, 'ax'), (-0.439, 'abc'), (0.146, 'abb'), (0.823, 'ab')]
>>> list(scores.objects())
['ax', 'abc', 'abb', 'ab']
>>> list(scores.standardized().rounded()) # z-scores
[(-1.268, 'ax'), (-0.215, 'abc'), (0.391, 'abb'), (1.092, 'ab')]
```## Dependencies:
* Python 3.4
* [iterlib](https://github.com/rectangletangle/iterlib)
* requests *(only for the scripts)*
* Beautiful Soup *(only for the scripts)*## Installation:
```bash
$ python3 setup.py install
```