https://github.com/roylee0704/indexer
Indexer, Full Text Indexing in Golang.
https://github.com/roylee0704/indexer
Last synced: 2 months ago
JSON representation
Indexer, Full Text Indexing in Golang.
- Host: GitHub
- URL: https://github.com/roylee0704/indexer
- Owner: roylee0704
- Created: 2016-04-29T14:23:42.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2016-05-18T07:30:19.000Z (almost 9 years ago)
- Last Synced: 2024-12-27T23:09:21.253Z (4 months ago)
- Language: Go
- Homepage:
- Size: 9.77 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# indexer
Indexer, an experimental project in Golang.## Functional Requirements
1. **Parse** a file!
- text file at the moment.2. **Split** strings of words!
- splitFunc(buf []byte, atEOF bool) (advance int, token []byte, err)
- build hit-item for each word.
- build word-frequency.
- goal: fast building construction.3. **Index** it!
- persistent it.
- goal: fast insertion.---
## Specification:
### SplitFunc(buf []byte, atEOF bool) (advance int, token []byte, err)
- `advance`: how much you can ignore on the next iteration (total # of runes)
- `token`: word/term extracted, if any.
- `err`: errorThis is under an assumption that a token is surrounded by control-breaks. Ignore first-half of control-breaks, end of token is found when last-half of control-break found. i.e: "cbcbcbcb**TokenFound**cbcb".
I have concluded that the same splitTerm isn't able to use to scan more than 2 languages even by tweaking the sig/insig chars.
In general, there are 2 cases in the function:
1. control-break found: return index(end-of-term), token, nil.
2. control-break !found:
- !atEOF: request for more data.
- atEOF: return len(token), token, err = finalToken####Example: ScanWords(control-break = '/space')
**Test-Case#1**: "ABC ".- advance: 4 (may ignore the entire buffer)
- token: "ABC"
- err: nil**Test-Case#2**: " ABC".
a) !atEOF:
- advance: 2 (ignore first-half of control break)
- token: ""
- err: nilb) atEOF: return remaining data as token(w/o first-half of control-break)
- advance: 5
- token: "ABC"
- err: nil