https://github.com/raypereda/multisearch
This is a command-line program for searching text for multiple words (or phrases) in a single pass. The runtime is O(n + m + z), where n is the length of the searched text, m is the total length of all the words we are looking for, and z is the total number of occurrences of words we are looking for.
https://github.com/raypereda/multisearch
aho-corasick-algorithm java search-in-text searching-algorithms
Last synced: 6 months ago
JSON representation
This is a command-line program for searching text for multiple words (or phrases) in a single pass. The runtime is O(n + m + z), where n is the length of the searched text, m is the total length of all the words we are looking for, and z is the total number of occurrences of words we are looking for.
- Host: GitHub
- URL: https://github.com/raypereda/multisearch
- Owner: raypereda
- Created: 2011-10-21T06:01:13.000Z (almost 14 years ago)
- Default Branch: master
- Last Pushed: 2017-07-15T07:33:31.000Z (about 8 years ago)
- Last Synced: 2025-03-25T16:51:30.224Z (7 months ago)
- Topics: aho-corasick-algorithm, java, search-in-text, searching-algorithms
- Language: Java
- Homepage:
- Size: 8.79 KB
- Stars: 7
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.TXT
- Changelog: newsarticle1.txt
Awesome Lists containing this project
README
This is a Java program for searching for a collection of words (or phrases)
at the same time. This uses the Aho-Corasick algorithm.
See http://en.wikipedia.org/wiki/Aho-Corasick_algorithmRay Pereda
raypereda (at) gmail$ java -jar multisearch.jar
Usage: java -jar msearch.jar -p PATTERNFILENAME FILENAME1 FILENAME2 ...
Search for a list of fixed patterns in a list target files.
Example: java -jar multisearch.jar -f patterns.txt newarticle1.txt newsarticle2.txtRequired:
must specify the patterns files with -f
must specify at least one target filenameSuppose you have a list of phrases that identify things that you're interested in.
Put those phrases one per in a file. Here's an example file:$ cat phrases-of-interests.txt
chocolate
laptop
bicycle
caveman
paleo
simplify
genomicsNow suppose you have a list of news articles that you want to scan for all possible
matches of phrases that are interesting. Here are two example news articles.$ cat newsarticle1.txt
This article is about the latest in bicycle races.
In here we will review the latest in eliptical gears.$ cat newsarticle2.txt
This article is about Otzi. A caveman that lived about 10,000 years ago.
paleo-genomics leverage DNA to piece together Otzi's life.Here's an example of multisearching for all the phrases in one pass through
the news articles:java -jar multisearch.jar -p phrases-of-interests.txt newsarticle1.txt newsarticle2.txt
target file: newsarticle1.txt
location: [ 36, 43] matched: bicycle
target file: newsarticle2.txt
location: [ 30, 37] matched: caveman
location: [ 73, 78] matched: paleo
location: [ 79, 87] matched: genomics