An open API service indexing awesome lists of open source software.

https://github.com/jcp/datafilter

Quickly find flags (words, phrases, etc) within your data. :male_detective:
https://github.com/jcp/datafilter

csv data-clean data-cleansing hate-speech-detection parser python swear-filter text textfile

Last synced: 2 months ago
JSON representation

Quickly find flags (words, phrases, etc) within your data. :male_detective:

Awesome Lists containing this project

README

          

# Data Filter

[![pypi](https://img.shields.io/pypi/v/datafilter.svg?color=brightgreen)](https://pypi.org/project/datafilter/)
[![pypi](https://img.shields.io/pypi/pyversions/datafilter.svg)](https://pypi.org/project/datafilter/)
[![codecov](https://codecov.io/gh/jcp/datafilter/branch/master/graph/badge.svg)](https://codecov.io/gh/jcp/datafilter/)
[![Build Status](https://travis-ci.org/jcp/datafilter.svg?branch=master)](https://travis-ci.org/jcp/datafilter/)

Quickly find tokens (words, phrases, etc) within your data.

Data Filter is a lightweight [data cleansing](https://en.wikipedia.org/wiki/Data_cleansing) framework that can be easily extended to support different data types, structures or processing requirements. It natively supports the following data types:

* CSV files
* Text files
* Text strings

# Table of Contents

* [Requirements](#requirements)
* [Installation](#installation)
* [Basic Usage](#basic-usage)
* [Features](#features)
* [Base](#base)
* [Filters](#filters)
* [CSV](#csv)
* [Text](#text)
* [TextFile](#textfile)

# Requirements

* Python 3.6+

# Installation

To install, simply use [pipenv](http://pipenv.org/) (or pip):

```bash
>>> pipenv install datafilter
```

# Basic Usage

## CSV

```python
from datafilter import CSV

tokens = ["Lorem", "ipsum", "Volutpat est", "mi sit amet"]
data = CSV("test.csv", tokens=tokens)
data.save("filtered.csv")
```

In this example, we open a CSV file, search all rows for normalized tokens and flag them. The `save` method creates a new CSV file with all rows that weren't flagged.

## Text

```python
from datafilter import Text

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
data = Text(text, tokens=["Lorem"])
print(next(data.results()))
```

In this example, we search a text string for normalized tokens. We can then iterator over the results using the `.results()` method, which returns a generator that yields [formatted results](#parse).

## Text File

```python
from datafilter import TextFile

data = TextFile("test.txt", tokens=["Lorem", "ipsum"], re_split=r"(?<=\.)")
print(next(data.results()))
```

In this example, we open a text file and split the data based on a regular expression defined by `re_split`.

# Features

Data Filter was designed to be highly extensible. Common or useful filters can be easily reused and shared. A few example use cases include:

* Filters that can handle different data types such as Microsoft Word, Google Docs, etc.
* Filters that can handle incoming data from external APIs.

## Base

Abstract base class that's subclassed by every filter.

`Base` includes several methods to ensure data is properly normalized, formatted and returned. The `.results()` method is an `@abstractmethod` to enforce its use in subclasses.

### Parameters

#### `tokens`

`type `

A list of strings that will be searched for within a set of data.

#### `translations`

`type `

A list of strings that will be removed during normalization.

**Default**

```python
['0123456789', '(){}[]<>!?.:;,`\'"@#$%^&*+-|=~–—/\\_', '\t\n\r\x0c\x0b']
```

#### `bidirectional`

`type `

When `True`, token matching will be bidirectional.

**Default**

```python
True
```

> **Note:**
>
> A common obfuscation method is to reverse the offending string or phrase. This helps detect those instances.

#### `caseinsensitive`

`type `

When `True`, tokens and data are converted to lowercase during normalization.

**Default**

```python
True
```

### Methods

#### `.results()`

Abstract method used to return results within a filter. This is defined by a `Base` subclass

#### `.maketrans()`

Returns a translation table used during normalization.

**Returns**

`type `

#### `.normalize(data)`

Returns normalized data. Normalization includes converting data to [lowercase](#caseinsensitive) and [removing strings](#translations).

Accepts parameter `data`.

**Returns**

`type `

> **Note:**
>
> Normalized data is returned as a tuple. The first element is the original data. The second element is the normalized data.
>

#### `.parse(data)`

Returns parsed and formatted data.

Accepts parameter `data`.

**Returns**

`type `

**Example**

Assume we're searching for the token "Lorem" in a very short text string.

```python
data = Text("Lorem ipsum dolor sit amet", tokens=["Lorem"])
print(next(data.results()))
```

The returned result would be formatted as:

```python
{
"data": "Lorem ipsum dolor sit amet",
"flagged": True,
"describe": {
"tokens": {
"detected": ["Lorem"],
"count": 1,
"frequency": {
"Lorem": 1,
},
}
},
}
```

> **Note:**
>
> `.parse()` should never be called directly. Use `.results()` instead.

## Filters

Filters subclass and extend the `Base` class to support various data types and structure. This extensibility allows for the creation of powerful custom filters specifically tailored to a given task, data type or structure.

## CSV

### Parameters

`CSV` is a subclass of `Base` and inherits all parameters.

#### `path`

`type `

Path to a CSV file.

### Methods

`CSV` is a subclass of `Base` and inherits all methods.

#### `.save(path)`

Saves results to a file.

Accepts parameter `path`. `path` is the absolute path and filename of the new file.

## Text

### Parameters

`Text` is a subclass of `Base` and inherits all parameters.

#### `text`

`type `

A text string.

#### `re_split`

`type `

A regular expression pattern or string that will be applied to `text` with `re.split` before normalization.

### Methods

`Text` is a subclass of `Base` and inherits all methods.

#### `.save(path, endofline=" ")`

Saves results to a file.

Accepts parameter `path` and `endofline`. `path` is the absolute path and filename of the new file. `endofline` is a line delimiter that will be added to the end of every row.

## TextFile

### Parameters

`TextFile` is a subclass of `Base` and inherits all parameters.

#### `path`

`type `

Path to a text file.

#### `re_split`

`type `

A regular expression pattern or string that will be applied to `text` with `re.split` before normalization.

### Methods

`TextFile` is a subclass of `Base` and inherits all methods.

#### `.save(path, endofline=" ")`

Saves results to a file.

Accepts parameter `path` and `endofline`. `path` is the absolute path and filename of the new file. `endofline` is a line delimiter that will be added to the end of every row.