https://github.com/jcp/datafilter

Quickly find flags (words, phrases, etc) within your data. :male_detective:
https://github.com/jcp/datafilter

csv data-clean data-cleansing hate-speech-detection parser python swear-filter text textfile

Last synced: 6 months ago
JSON representation

Quickly find flags (words, phrases, etc) within your data. :male_detective:

Host: GitHub
URL: https://github.com/jcp/datafilter
Owner: jcp
License: bsd-3-clause
Created: 2019-08-11T16:52:38.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2022-12-26T20:59:40.000Z (over 3 years ago)
Last Synced: 2026-01-01T15:15:20.270Z (7 months ago)
Topics: csv, data-clean, data-cleansing, hate-speech-detection, parser, python, swear-filter, text, textfile
Language: Python
Homepage:
Size: 88.9 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

          # Data Filter

[![pypi](https://img.shields.io/pypi/v/datafilter.svg?color=brightgreen)](https://pypi.org/project/datafilter/)

[![pypi](https://img.shields.io/pypi/pyversions/datafilter.svg)](https://pypi.org/project/datafilter/)

[![codecov](https://codecov.io/gh/jcp/datafilter/branch/master/graph/badge.svg)](https://codecov.io/gh/jcp/datafilter/)

[![Build Status](https://travis-ci.org/jcp/datafilter.svg?branch=master)](https://travis-ci.org/jcp/datafilter/)

Quickly find tokens (words, phrases, etc) within your data.

Data Filter is a lightweight [data cleansing](https://en.wikipedia.org/wiki/Data_cleansing) framework that can be easily extended to support different data types, structures or processing requirements. It natively supports the following data types:

* CSV files

* Text files

* Text strings

# Table of Contents

* [Requirements](#requirements)

* [Installation](#installation)

* [Basic Usage](#basic-usage)

* [Features](#features)

    * [Base](#base)

    * [Filters](#filters)

        * [CSV](#csv)

        * [Text](#text)

        * [TextFile](#textfile)

# Requirements

* Python 3.6+

# Installation

To install, simply use [pipenv](http://pipenv.org/) (or pip):

```bash

>>> pipenv install datafilter

```

# Basic Usage

## CSV

```python

from datafilter import CSV

tokens = ["Lorem", "ipsum", "Volutpat est", "mi sit amet"]

data = CSV("test.csv", tokens=tokens)

data.save("filtered.csv")

```

In this example, we open a CSV file, search all rows for normalized tokens and flag them. The `save` method creates a new CSV file with all rows that weren't flagged.

## Text

```python

from datafilter import Text

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"

data = Text(text, tokens=["Lorem"])

print(next(data.results()))

```

In this example, we search a text string for normalized tokens. We can then iterator over the results using the `.results()` method, which returns a generator that yields [formatted results](#parse).

## Text File

```python

from datafilter import TextFile

data = TextFile("test.txt", tokens=["Lorem", "ipsum"], re_split=r"(?<=\.)")

print(next(data.results()))

```

In this example, we open a text file and split the data based on a regular expression defined by `re_split`. 

# Features

Data Filter was designed to be highly extensible. Common or useful filters can be easily reused and shared. A few example use cases include:

* Filters that can handle different data types such as Microsoft Word, Google Docs, etc.

* Filters that can handle incoming data from external APIs.

## Base

Abstract base class that's subclassed by every filter.

`Base` includes several methods to ensure data is properly normalized, formatted and returned. The `.results()` method is an `@abstractmethod` to enforce its use in subclasses.

### Parameters

#### `tokens`

`type `

A list of strings that will be searched for within a set of data.

#### `translations`

`type `

A list of strings that will be removed during normalization.

**Default**

```python

['0123456789', '(){}[]<>!?.:;,`\'"@#$%^&*+-|=~–—/\\_', '\t\n\r\x0c\x0b']

```

#### `bidirectional`

`type `

When `True`, token matching will be bidirectional. 

**Default**

```python

True

```

> **Note:**

>

> A common obfuscation method is to reverse the offending string or phrase. This helps detect those instances.

#### `caseinsensitive`

`type `

When `True`, tokens and data are converted to lowercase during normalization.

**Default**

```python

True

```

### Methods

#### `.results()`

Abstract method used to return results within a filter. This is defined by a `Base` subclass

#### `.maketrans()`

Returns a translation table used during normalization.

**Returns**

`type `

#### `.normalize(data)`

Returns normalized data. Normalization includes converting data to [lowercase](#caseinsensitive) and [removing strings](#translations).

Accepts parameter `data`.

**Returns**

`type `

> **Note:**

>

> Normalized data is returned as a tuple. The first element is the original data. The second element is the normalized data.

>

#### `.parse(data)`

Returns parsed and formatted data.

Accepts parameter `data`.

**Returns**

`type `

**Example**

Assume we're searching for the token "Lorem" in a very short text string.

```python

data = Text("Lorem ipsum dolor sit amet", tokens=["Lorem"])

print(next(data.results()))

```

The returned result would be formatted as:

```python

{

    "data": "Lorem ipsum dolor sit amet",

    "flagged": True,

    "describe": {

        "tokens": {

            "detected": ["Lorem"],

            "count": 1,

            "frequency": {

                "Lorem": 1,

            },

        }

    },

}

```

> **Note:**

>

> `.parse()` should never be called directly. Use `.results()` instead.

## Filters

Filters subclass and extend the `Base` class to support various data types and structure. This extensibility allows for the creation of powerful custom filters specifically tailored to a given task, data type or structure.

## CSV

### Parameters

`CSV` is a subclass of `Base` and inherits all parameters.

#### `path`

`type `

Path to a CSV file.

### Methods

`CSV` is a subclass of `Base` and inherits all methods.

#### `.save(path)`

Saves results to a file.

Accepts parameter `path`. `path` is the absolute path and filename of the new file.

## Text

### Parameters

`Text` is a subclass of `Base` and inherits all parameters.

#### `text`

`type `

A text string.

#### `re_split`

`type `

A regular expression pattern or string that will be applied to `text` with `re.split` before normalization.

### Methods

`Text` is a subclass of `Base` and inherits all methods.

#### `.save(path, endofline=" ")`

Saves results to a file.

Accepts parameter `path` and `endofline`. `path` is the absolute path and filename of the new file. `endofline` is a line delimiter that will be added to the end of every row.

## TextFile

### Parameters

`TextFile` is a subclass of `Base` and inherits all parameters.

#### `path`

`type `

Path to a text file.

#### `re_split`

`type `

A regular expression pattern or string that will be applied to `text` with `re.split` before normalization.

### Methods

`TextFile` is a subclass of `Base` and inherits all methods.

#### `.save(path, endofline=" ")`

Saves results to a file.

Accepts parameter `path` and `endofline`. `path` is the absolute path and filename of the new file. `endofline` is a line delimiter that will be added to the end of every row.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jcp/datafilter

Awesome Lists containing this project

README