Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/reidmorrison/data_cleansing

Cleanse data received via Rails, APIs, files, or inside plain ruby objects.
https://github.com/reidmorrison/data_cleansing

cleaners rails ruby transform

Last synced: 2 months ago
JSON representation

Cleanse data received via Rails, APIs, files, or inside plain ruby objects.

Awesome Lists containing this project

README

        

data_cleansing
==============

Data Cleansing framework for Ruby.

* http://github.com/reidmorrison/data_cleansing

## Introduction

It is important to keep internal data free of unwanted escape characters, leading
or trailing blanks and even newlines.
Similarly it would be useful to be able to attach a cleansing solution to a field
in a model and have the data cleansed transparently when required.

DataCleansing is a framework that allows data cleansing to be applied to
specific attributes or fields.

## Features

* Supports global cleansing definitions that can be associated with any Ruby,
Rails, Mongoid, or other model
* Supports custom cleansing definitions that can be defined in-line
* A cleansing block can access the other attributes in the model while cleansing
the current attribute
* In a cleansing block other attributes in the model can be modified at the
same time
* Cleansers are executed in the order they are defined. As a result multiple
cleansers can be run against the same field and the order is preserved
* Multiple cleansers can be specified for a list of attributes at the same time
* Inheritance is supported. The cleansers for parent classes are run before
the child's cleansers
* Cleansers can be called outside of a model instance for cases where fields
need to be cleansed before the model is created, or needs to be found
* To aid troubleshooting the before and after values of cleansed attributes
is logged. The level of detail is fine-tuned using the log level

## ActiveRecord (ActiveModel) Features

* Passes the value of the attribute before the Rails type cast so that the
original text can be cleansed before passing back to rails for type conversion.
This is important for numeric and date fields where spaces and control characters
can have undesired effects

## Examples

### Ruby Example
```ruby
require 'data_cleansing'

# Define a global cleaner
DataCleansing.register_cleaner(:strip) {|string| string.strip}

class User
include DataCleansing::Cleanse

attr_accessor :first_name, :last_name

# Strip leading and trialing whitespace from first_name and last_name
cleanse :first_name, :last_name, :cleaner => :strip
end

u = User.new
u.first_name = ' joe '
u.last_name = "\n black\n"
puts "Before data cleansing #{u.inspect}"
# Before data cleansing #

u.cleanse_attributes!
puts "After data cleansing #{u.inspect}"
# After data cleansing #
```

### Rails Example

```ruby
# Define a global cleanser
DataCleansing.register_cleaner(:strip) {|string| string.strip}

# 'users' table has the following columns :first_name, :last_name, :address1, :address2
class User < ActiveRecord::Base
include DataCleansing::Cleanse

# Use a global cleaner
cleanse :first_name, :last_name, :cleaner => :strip

# Define a once off cleaner
cleanse :address1, :address2, :cleaner => Proc.new {|string| string.strip}

# Automatically cleanse data before validation
before_validation :cleanse_attributes!
end

# Create a User instance
u = User.new(:first_name => ' joe ', :last_name => "\n black\n", :address1 => "2632 Brown St \n")
puts "Before data cleansing #{u.attributes.inspect}"
u.validate
puts "After data cleansing #{u.attributes.inspect}"
u.save!
```

### Advanced Ruby Example

```ruby
require 'data_cleansing'

# Define a global cleaners
DataCleansing.register_cleaner(:strip) {|string| string.strip}
DataCleansing.register_cleaner(:upcase) {|string| string.upcase}

class User
include DataCleansing::Cleanse

attr_accessor :first_name, :last_name, :title, :address1, :address2, :gender

# Use a global cleaner
cleanse :first_name, :last_name, :cleaner => :strip

# Define a once off cleaner
cleanse :address1, :address2, :cleaner => Proc.new {|string| string.strip}

# Use multiple cleaners, and a custom block
cleanse :title, :cleaner => [:strip, :upcase, Proc.new {|string| "#{string}." unless string.end_with?('.')}]

# Change the cleansing rule based on the value of other attributes in that instance of user
# The 'title' is retrieved from the current instance of the user
cleanse :gender, :cleaner => [
:strip,
:upcase,
Proc.new do |gender|
if (gender == "UNKNOWN") && (title == "MR.")
"Male"
else
"Female"
end
end
]
end

u = User.new
u.first_name = ' joe '
u.last_name = "\n black\n"
u.address1 = "2632 Brown St \n"
u.title = " \nmr \n"
u.gender = " Unknown "
puts "Before data cleansing #{u.inspect}"
# Before data cleansing #

u.cleanse_attributes!
puts "After data cleansing #{u.inspect}"
# After data cleansing #
```

## After Cleansing

It is sometimes useful to read or write multiple fields as part of a cleansing, or
where attributes need to be manipulated automatically once they have been cleansed.
For this purpose instance methods on the model can be registered for invocation once
all the attributes have been cleansed according to their :cleanse specifications.
Multiple methods can be registered and they are called in the order they are registered.

```ruby
after_cleanse , , ...
```

Example:
```ruby
# Define a global cleanser
DataCleansing.register_cleaner(:strip) {|string| string.strip}

# 'users' table has the following columns :first_name, :last_name, :address1, :address2
class User < ActiveRecord::Base
include DataCleansing::Cleanse

# Use a global cleaner
cleanse :first_name, :last_name, :cleaner => :strip

# Define a once off cleaner
cleanse :address1, :address2, :cleaner => Proc.new {|string| string.strip}

# Once the above cleansing is complete call the instance method
after_cleanse :check_address

protected

# Method to be called once data cleansing is complete
def check_address
# Move address2 to address1 if Address1 is blank and address2 has a value
address2 = address1 if address1.blank? && !address2.blank?
end

end

# Create a User instance
u = User.new(:first_name => ' joe ', :last_name => "\n black\n", :address2 => "2632 Brown St \n")
puts "Before data cleansing #{u.attributes.inspect}"
u.cleanse_attributes!
puts "After data cleansing #{u.attributes.inspect}"
u.save!
```

## Recommendations

:data_cleanse block are ideal for cleansing a single attribute, and applying any
global or common cleansing algorithms.

Even though multiple attributes can be read or written in a single :data_cleanse
block, it is recommended to use the :after_cleanse method for working with multiple
attributes. It is much easier to read and understand the interactions between multiple
attributes in the :after_cleanse methods.

## Rails configuration

When DataCleansing is used in a Rails environment it can be configured using the
regular Rails configuration mechanisms. For example:

```ruby
module MyApplication
class Application < Rails::Application

# Data Cleansing Configuration

# Attributes who's values are to be masked out during logging
config.data_cleansing.register_masked_attributes :bank_account_number, :social_security_number

# Optionally override the default log level
# Set to :trace or :debug to log all fields modified
# Set to :info to log only those fields which were nilled out
# Set to :warn or higher to disable logging of cleansing actions
config.data_cleansing.logger.level = :info

# Register any global cleaners
config.data_cleansing.register_cleaner(:strip) {|string| string.strip}

end
end
```

## Logging

DataCleansing uses SemanticLogger for logging due to it's excellent integration
with Rails and its ability to log data in it's raw form to Mongo and to files.

If running a Rails application it is recommended to install the gem
rails_semantic_logger which replaces the default Rails logger. It is however
possible to configure the semantic_logger gem to use the existing Rails logger
in a Rails initializer as follows:

```ruby
SemanticLogger.default_level = Rails.logger.level
SemanticLogger.add_appender(logger: Rails.logger)
```

By changing the log level of DataCleansing itself the type of output for data
cleansing can be controlled:

* :trace or :debug to log all fields modified
* :info to log only those fields which were nilled out
* :warn or higher to disable logging of cleansing actions

Note:

* The logging of changes made to attributes only includes attributes cleansed
with :data_cleanse blocks. Attributes modified within :after_cleanse methods
are not logged

* It is not necessary to change the global log level to affect the logging detail
level in DataCleansing. DataCleansing log level is changed independently

To change the log level, either use the Rails configuration approach, or set it
directly:

```ruby
DataCleansing.logger.level = :info
```

## Notes

* Cleaners are called in the order in which they are defined, so subsequent cleaners
can assume that the previous cleaners have run and can therefore access or even
modify previously cleaned attributes

## Installation

### Add to an existing Rails project

Add the following line to Gemfile

```ruby
gem 'data_cleansing'
```

Install the Gem with bundler

bundle install

## Dependencies

DataCleansing requires the following dependencies

* Ruby V1.9.3, V2 and greater
* Rails V3.2 (Active Model) or greater for Rails integration ( Only if Rails is being used )
* Mongoid and Mongomapper supporting Active Model V3.2 or greater ( Only if Mongoid or MongoMapper is being used )

## Meta

* Code: `git clone git://github.com/reidmorrison/data_cleansing.git`
* Home:
* Issues:
* Gems:

This project uses [Semantic Versioning](http://semver.org/).

## Authors

Reid Morrison :: [email protected] :: @reidmorrison

## License

Copyright 2013, 2014, 2015, 2016 Reid Morrison

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.