https://github.com/stackbuilders/corasick-park
Server for velociraptor-speed substitutions using the Aho-Corasick algorithm.
https://github.com/stackbuilders/corasick-park
Last synced: 5 months ago
JSON representation
Server for velociraptor-speed substitutions using the Aho-Corasick algorithm.
- Host: GitHub
- URL: https://github.com/stackbuilders/corasick-park
- Owner: stackbuilders
- License: mit
- Created: 2014-09-15T03:20:47.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2014-09-29T23:01:24.000Z (over 11 years ago)
- Last Synced: 2025-03-16T17:46:25.949Z (10 months ago)
- Language: Haskell
- Homepage:
- Size: 383 KB
- Stars: 1
- Watchers: 60
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Corasick Park
Corasick Park is a server for quickly applying lots of transformations
to strings. By using a simple JSON interface, you can specify the
string replacements that should be made. After you describe the
transformations that should be made, you can hit another JSON endpoint
to quickly apply the transformations that you described.
Corasick Park is able to apply hundreds of thousands of transactions
in less than a second by using
[the Aho-Corasick algorithm](http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm)
for finding out which transformations apply to the string you
provide. Currently the server uses the
[Aho-Corasick implementation in Haskell](http://hackage.haskell.org/package/AhoCorasick)
for the heavy lifting of finding the correct patterns to apply.
Since it is very quick even to precompile hundreds of thousands of
patterns, the server doesn't persist anything to disk, and uses
Haskell's MVars for safe concurrency.
Corasick Park allows you to precompile patterns in groups by a
'bucket' name so that you can easily choose which set of
transformations to apply for each string.
## Downloading and Running
In order to install this project, you should first install GHC
(Haskell compiler) and cabal (Haskell dependency management). Once you
have these dependencies installed, clone this project using git, `cd`
to the folder where this project lives, and execute the following:
```bash
cabal install --only-dependencies --enable-tests
cabal test
```
If the test command runs without error, you have correctly installed
the application.
In order to start the server, execute `cabal run`. The server will
start on port 8000, and you can test it by sending JSON requests
similar to the following.
## Interface
`POST` the transformations that you wish to apply to `/operations`:
```json
{"name": "downcasefoos", "operations":
[{"target":
{ "text": "foo",
"isCaseSensitive": false,
"leftBoundaryType": "none",
"rightBoundaryType": "none",
"isGlobal": true
},
"transform":{"type": "replace", "replacement": "bar"}
}]
}
```
Then, apply the transformations you specified by submitting a `POST`
request to the `/transform` endpoint:
```json
{ "name": "testbucket", "input": "foo bar baz" }
```
Result:
```json
{
"result" : "bar bar baz"
}
```
Since Corasick Park doesn't store anything to disk, clients could
first try posting to a particular bucket to apply a transformation,
and if the server returns a 404 (bucket not found) the client should
post all of the transformations again, and then re-try the
transformation endpoint.
## Supported Target Patterns
Corasick Park doesn't have full support for regular
expressions. However you can customize the patterns that are matched
for replacements by specifying case sensitivity (`isCaseSensitive`
option), global or single replacement (`isGlobal` option), and the
types of boundaries on each side of the pattern you specified. The
currently supported boundaries, which can be supplied to the
`leftBoundaryType` and `rightBoundaryType` option are:
* `none` - there is no requirement for the type of boundary around the
string
* `word` - the match must be surrounded by word boundaries (similar to
`\b` in a Ruby regular expression
* `line` - there must be a newline at the specified side of the match
* `input` - the input boundary must be on the specified side of the
match. To specify that the match needs to be exact, just put an
input boundary on both sides.
## Supported String Transformations
Corasick Park can apply a variety of transformations to input strings
after efficiently finding the applicable transformations using the
Aho-Corasick state machine. The following transformations are
currently supported:
* `replace` - Replaces the target with another target
* `upcase` - Upper-cases matches of the target in input text
* `downcase` - Lower-cases matches of the target in input text
* `titleize` - Captalizes the first letter in each word of the input
string
* `truncate trailing` - Removes the text following each match of the string
## Configuring the maximum number of buckets
Since all of the state machines compiled by Corasick Park are held in memory,
it uses an LRU algorithm to evict older values from the cache. By default,
1,000 'buckets' are kept in memory before evicting the one that has not had an
access in the greatest amount of time. You can configure the number of buckets
to be held in memory with the MAX_BUCKETS environment variable, which accepts
an integer, or the string "unlimited" if you do not wish to automatically evict
buckets from memory.
## Author
Justin Leitgeb, Stack Builders Inc.