Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/beacon-biosignals/keywordsearch.jl

Fuzzy search of documents
https://github.com/beacon-biosignals/keywordsearch.jl

Last synced: about 6 hours ago
JSON representation

Fuzzy search of documents

Awesome Lists containing this project

README

        

[![Build Status](https://github.com/beacon-biosignals/KeywordSearch.jl/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/beacon-biosignals/KeywordSearch.jl/actions?query=workflow%3ACI)
[![codecov](https://codecov.io/gh/beacon-biosignals/KeywordSearch.jl/branch/main/graph/badge.svg?token=0HRHZ1BL60)](https://codecov.io/gh/beacon-biosignals/KeywordSearch.jl)
[![Docs: stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://beacon-biosignals.github.io/KeywordSearch.jl/stable)
[![Docs: development](https://img.shields.io/badge/docs-dev-blue.svg)](https://beacon-biosignals.github.io/KeywordSearch.jl/dev)

# KeywordSearch

Given a collection of free-text documents and a set of keywords, we want to know: which keywords have fuzzy matches in which documents?

There are two types of errors:

* false positive: we flag a match when it does not exist
* false negative: we miss a match that should exist

Here, we primarily wish to minimize false negatives (ideally without having too many false positives either!). That is, we wish to error on the side of having too many matches:
in some cases where we have many haystacks and few needles, we'd rather have to manually filter out some pins (false needles 😉) than miss a needle.

In particular, we want robust to the following types of mistakes in the documents which could cause us to miss a match:

1. Misspellings
2. word conjunctions (missed space separating words)
3. hyphenated words conjoined without a hyphen
4. hyphenated words/phrase joined with spaces instead of hyphens
5. Erroneously redacted terms (e.g. `Darwin's Frog` being redacted as suspected PHI since `Darwin` could be flagged as name, whereas in this case it is the name of an animal).

We also will pay attention to one source of false positives:

6. Search terms that are subsets of other general terms (not the intended term class). E.g. "ant" could show up as part of "anteater", but that would not be a correct match if we're looking for the little insects.

We address...

* ...(1) by using fuzzy matching via an edit distance (i.e. a string matches another substring if it is the same up to `n` edits, where we have to choose `n`). Edit distances are supplied by [StringDistances.jl](https://github.com/matthieugomez/StringDistances.jl).
* ...(2) by not requiring matches between word boundaries but just between substrings (i.e. if the query is `cobra` and `kingcobra` shows up in the document, then we would have a perfect match with the substring `cobra` found in the document)
* ...(3) and (4) by first replacing hyphens with spaces, and then augmenting our search terms by taking any terms with spaces or hyphens and generating a list of terms with each possible choice of (spaces, no spaces). For example, the query "crab-eating macaque" is augmented to the query ("crab eating macaque" OR "crabeating macaque" OR "crab eatingmacaque" OR "crabeatingmacaque").
* ...(5) Here, we allow a global list of replacements (`KeywordSearch.AUTOMATIC_REPLACEMENTS`) to manually undo erroneous redaction.
* ...(6) Here, our solution to (2) has gotten us into trouble. What we can do is add spaces to our term, e.g., instead of searching for "ant" we can search for " ant " and require an exact match. This is accomplished by e.g. `word_boundary(Query("ant"))` in the language of KeywordSearch.jl.

## Example

```julia
julia> using KeywordSearch, UUIDs

julia> document = Document("""
The crabeating macacue ate a crab.
""", (; document_uuid = uuid4()))
Document starting with "The crabeating macacue…". Metadata: (document_uuid = UUID("a703302c-eeda-46ba-8755-940a7db86b63"),)

julia> query = augment(FuzzyQuery("crab-eating macaque"))
Or
├─ FuzzyQuery("crab eating macaque", DamerauLevenshtein(), 2)
├─ FuzzyQuery("crabeating macaque", DamerauLevenshtein(), 2)
├─ FuzzyQuery("crab eatingmacaque", DamerauLevenshtein(), 2)
└─ FuzzyQuery("crabeatingmacaque", DamerauLevenshtein(), 2)

julia> m = match(query, document)
QueryMatch with distance 1 at indices 5:22.

julia> explain(m)
The query "crabeating macaque" matched the text "The crabeating macacue ate a crab \n " with distance 1.

```

Currently, only the `StringDistances.DamerauLevenshtein` distance measure is supported.