https://github.com/roylee0704/indexer

Indexer, Full Text Indexing in Golang.
https://github.com/roylee0704/indexer

Last synced: 2 months ago
JSON representation

Indexer, Full Text Indexing in Golang.

Host: GitHub
URL: https://github.com/roylee0704/indexer
Owner: roylee0704
Created: 2016-04-29T14:23:42.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2016-05-18T07:30:19.000Z (almost 9 years ago)
Last Synced: 2024-12-27T23:09:21.253Z (4 months ago)
Language: Go
Homepage:
Size: 9.77 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # indexer

Indexer, an experimental project in Golang.

## Functional Requirements

1. **Parse** a file!

  - text file at the moment.

2. **Split** strings of words!

  - splitFunc(buf []byte, atEOF bool) (advance int, token []byte, err)

  - build hit-item for each word.

  - build word-frequency.

  - goal: fast building construction.

3. **Index** it!

  - persistent it.

  - goal: fast insertion.

---

## Specification:

### SplitFunc(buf []byte, atEOF bool) (advance int, token []byte, err)

- `advance`: how much you can ignore on the next iteration (total # of runes)

- `token`: word/term extracted, if any.

- `err`: error

This is under an assumption that a token is surrounded by control-breaks. Ignore first-half of control-breaks, end of token is found when last-half of control-break found. i.e: "cbcbcbcb**TokenFound**cbcb".

I have concluded that the same splitTerm isn't able to use to scan more than 2 languages even by tweaking the sig/insig chars.

In general, there are 2 cases in the function:

1. control-break found: return index(end-of-term), token, nil.

2. control-break !found:

  - !atEOF:  request for more data.

  - atEOF: return len(token), token, err = finalToken

####Example: ScanWords(control-break = '/space')

**Test-Case#1**: "ABC ".

- advance: 4 (may ignore the entire buffer)

- token: "ABC"

- err: nil

**Test-Case#2**: "  ABC".

a) !atEOF:

  - advance: 2 (ignore first-half of control break)

  - token: ""

  - err: nil

b) atEOF: return remaining data as token(w/o first-half of control-break)

  - advance: 5

  - token: "ABC"

  - err: nil

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/roylee0704/indexer

Awesome Lists containing this project

README