https://github.com/mongodb-developer/lucene-search-analysis

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/mongodb-developer/lucene-search-analysis
Owner: mongodb-developer
License: unlicense
Created: 2020-09-30T00:53:00.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2022-08-10T19:36:43.000Z (almost 3 years ago)
Last Synced: 2024-04-14T11:12:35.116Z (about 1 year ago)
Language: Java
Size: 550 KB
Stars: 6
Watchers: 135
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        # Notice: Repository Deprecation

This repository is deprecated and no longer actively maintained. It contains outdated code examples or practices that do not align with current MongoDB best practices. While the repository remains accessible for reference purposes, we strongly discourage its use in production environments.

Users should be aware that this repository will not receive any further updates, bug fixes, or security patches. This code may expose you to security vulnerabilities, compatibility issues with current MongoDB versions, and potential performance problems. Any implementation based on this repository is at the user's own risk.

For up-to-date resources, please refer to the [MongoDB Developer Center](https://mongodb.com/developer).

# Atlas/Lucene Search Analysis

## Introduction

Atlas Search uses [Lucene Analyzers](https://docs.atlas.mongodb.com/reference/atlas-search/analyzers/) to control how the index sets search terms, e.g., where to break up word groupings and whether to consider punctuation. However, without an intimate knowledge of the various Lucene Analyzers, it can be difficult to select the appropriate analyzer for a given field when creating a search index. Inspired by the [Analysis Screen](https://lucene.apache.org/solr/guide/8_6/analysis-screen.html) in Apache Solr, this utility provides two simple ways -- a CLI and a UI -- to test how various analyzers will process a given text.

**Note:** see this [Atlas feature request](https://feedback.mongodb.com/forums/924868-atlas-search/suggestions/41501065-analzye-endpoint-or-analysis-screen)

## Build

The UI for this tool is implemented as a [Vaadin](https://vaadin.com/) web application. The CLI is implemented as a simple POJO. To build these tools, execute the command `mvn clean package` from the directory containing `pom.xml`.

## Run

Use the command `mvnw -Dspring-boot.run.jvmArguments="-Dspring.devtools.restart.enabled=false"` from the directory containing `pom.xml` to launch the Web UI on port 8080.

![Web UI](src/main/resources/vaadin_app.png)

Use the command `mvn -f pom-cli.xml exec:java -Dexec.args=""` with the appropriate options to run the CLI. Use the `-h` or `--help` options to see usage instructions.

```bash

usage: mvn exec:java -Dexec.args=""

 -h, --help               Prints this message

 -a, --analyzer      Lucene analyzer to use (defaults to 'Standard').

                         Use 'list' for supported analyzer names.

 -d, --definition    Index definition file containing custom analyzer

 -t, --text          Input text to analyze

 -f, --file          Input text file to analyze

 -l, --language      Language code (used with '--analyzer=Language'

                         only.  Use 'list' for supported language codes.

 -n, --name          Custom analyzer name

 -o, --operator      Query operator to use (defaults to 'text'). Use

                         'list' for supported operator names.

 -k, --tokenizer     Tokeniser to use with autocomplete operator

                         (defaults to 'edgeGram'). Use 'list' for

                         supported tokenizer names.

 -m, --minGrams      Minimum number of characters per indexed sequence

                         to use with autocomplete operator (defaults to

                         '2').

 -x, --maxGrams      Maximum number of characters per indexed sequence

                         to use with autocomplete operator (defaults to

                         '3').

```

You can also use `java -cp lib/ -jar /atlas-search-analysis-0.0.1.jar ` (Java 11 or later) to run the CLI.

## Examples

### Analyze text using the `lucene.simple` analyzer

```bash

mvn -f pom-cli.xml -q exec:java -Dexec.args="-a simple -t 'hello my-name.is Roy/Kiesler'"

Using org.apache.lucene.analysis.core.SimpleAnalyzer

[hello] [my] [name] [is] [roy] [kiesler]

```

### Analyze text using the `lucene.standard` analyzer

```bash

mvn -f pom-cli.xml -q exec:java -Dexec.args="--analyzer standard --text 'hello my-name.is Roy/Kiesler'"

Using org.apache.lucene.analysis.standard.StandardAnalyzer

[hello] [my] [name.is] [roy] [kiesler]

```

### Analyze text using the `lucene.whitespace` analyzer

```bash

mvn -f pom-cli.xml -q exec:java -Dexec.args="--analyzer whitespace -t 'hello my-name.is Roy/Kiesler'"

Using org.apache.lucene.analysis.core.WhitespaceAnalyzer

[hello] [my-name.is] [Roy/Kiesler]

```

### Analyze text using the `lucene.language` English analyzer

```bash

mvn -f pom-cli.xml -q exec:java -Dexec.args="--analyzer language --language en --text 'running a race'"

Using org.apache.lucene.analysis.en.EnglishAnalyzer

[run] [race]

```

### Analyze a text file using the `lucene.language` French analyzer

```bash

cat <> french.txt

bonjour je m'appelle Roy Kiesler

EOF

mvn -q exec:java -Dexec.args="-a language -l fr -f french.txt"

Using org.apache.lucene.analysis.fr.FrenchAnalyzer

[bonjou] [apel] [roy] [kiesl]

```

### Analyze text using a custom analyzer

**Sample 1**

```bash

mvn -f pom-cli.xml -q exec:java -Dexec.args="-a custom -t 'ROCKY II is better than Rocky V' -d index_roman.json -n romanAnalyzer"

Using org.apache.lucene.analysis.custom.CustomAnalyzer

[rocki] [2] [better] [rocki] [5]

```

**Sample 2**

```bash

mvn -f pom-cli.xml -q exec:java -Dexec.args="-a custom -t '
This is an HTML test' -d index_html.json -n htmlStrippingAnalyzer"

Using org.apache.lucene.analysis.custom.CustomAnalyzer

[p] [This] [is] [an] [a] [href] [foo.com] [HTML] [a] [test] [p]

```

### Analyze text using autocomplete

**Sample 1**

```bash

mvn -f pom-cli.xml exec:java -Dexec.args="-t 'Ribeira Charming Duplex' -o autocomplete"

Using org.apache.lucene.analysis.custom.CustomAnalyzer

Autocomplete - nGram, minGram(2), maxGram(3)

[Ri] [Rib] [ib] [ibe] [be] [bei] [ei] [eir] [ir] [ira] [ra] [ra ] [a ] [a C] [ C] [ Ch] [Ch] [Cha] [ha] [har] [ar] [arm] [rm] [rmi] [mi] [min] [in] [ing] [ng] [ng ] [g ] [g D] [ D] [ Du] [Du] [Dup] [up] [upl] [pl] [ple] [le] [lex] [ex]

```

**Sample 2**

```bash

mvn -f pom-cli.xml exec:java -Dexec.args="-t 'Ribeira Charming Duplex' -o autocomplete -k edgeGram -m 2 -x 15"

Using org.apache.lucene.analysis.custom.CustomAnalyzer

Autocomplete - edgeNGram, minGram(2), maxGram(15)

[Ri] [Rib] [Ribe] [Ribei] [Ribeir] [Ribeira] [Ribeira ] [Ribeira C] [Ribeira Ch] [Ribeira Cha] [Ribeira Char] [Ribeira Charm] [Ribeira Charmi] [Ribeira Charmin]

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mongodb-developer/lucene-search-analysis

Awesome Lists containing this project

README