https://github.com/AydinSakar/lucene-layer

Lucene on FoundationDB
https://github.com/AydinSakar/lucene-layer
Last synced: about 1 month ago
JSON representation
Lucene on FoundationDB
Host: GitHub
URL: https://github.com/AydinSakar/lucene-layer
Owner: AydinSakar
License: other
Created: 2014-02-28T02:57:29.000Z (about 11 years ago)
Default Branch: master
Last Pushed: 2013-09-04T22:53:37.000Z (over 11 years ago)
Last Synced: 2024-07-31T20:35:57.329Z (9 months ago)
Language: Java
Size: 473 KB
Stars: 3
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesome-foundationdb - Lucene layer
README

        # FoundationDB Lucene Layer

This layer provides two integration points with Lucene, `FDBDirectory` and 

`FDBCodec`. These are full implementations of the [Directory](https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/store/Directory.html)

and [Codec](https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/codecs/Codec.html)

interfaces which are backed entirely by [FoundationDB](https://foundationdb.com/).

`FDBDirectory` can be used on its own with the default `Codec` doing the

interesting work. Files generated by Lucene are stored as blobs in the

database instead of the file system.

`FDBCodec`, which must be used in conjunction with `FDBDirectory`, implements 

new serialization and data models for Lucene. This results in explicit keys and

values in the database instead of file-like blobs.

## _Warning: Alpha Stage_

This layer is at an early alpha stage (note the 0.0.1 version number). While

most of the stock Lucene tests pass when using `FDBDirectory`, many currently

fail when running with `FDBCodec`. There are no known _correctness_ issues at

this time but slowness and timeout issues could easily be hiding such problems.

Please try it out and let us know how it works (e.g. on our

[community site](http://community.foundationdb.com/)), but production usage

is *not* recommended.

## FDBCodec Data Model

The [Subspace](https://foundationdb.com/documentation/data-modeling.html#subspaces-of-keys)

concept is used extensively to provide a simple, logical mapping and easy

storage and retrieval. Each directory, segment and format are identified by a

unique string. These identifier strings are then concatenated together to yield

key ranges associated with each logical format being stored.

For example, assume we have a `FDBDirectory` created with the path

`("lucene")` and a segment named `"_0"`. That would result in the following

[Tuples](https://foundationdb.com/documentation/data-modeling.html#tuples):

- `("lucene", "_0", "dat")` for DocValues

- `("lucene", "_0", "inf")` for FieldInfos

- `("lucene", "_0", "liv")` for LiveDocs

- _etc_

Additional keys and values exist under each of those subspaces for storing the 

information associated with each format. In the documentation below, the full

subspace is the concatenation of the directory, segment and format subspaces.

### DocValuesFormat

Encodes/decodes strongly typed, per document values. See

[DocValuesFormat](https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/DocValuesFormat.html)

and

[FieldInfo.DocValuesType](https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/FieldInfo.DocValuesType.html).

The `long_BINARY`, `long_NUMERIC`, `long_SORTED` and `long_SORTED_SET` key parts

below refer to the `DocValuesType` enum `ordinal()` values.

Subspace: `("dat")`

    (str_fieldName, long_BINARY, long_doc0) => (bytes_value)

    (str_fieldName, long_BINARY, long_doc1) => (bytes_value)

    ...

    (str_fieldName, long_NUMERIC, long_doc0) => (long_value)

    (str_fieldName, long_NUMERIC, long_doc1) => (long_value)

    ...

    (str_fieldName, long_SORTED, "bytes", long_ordinal0) => (bytes_value)

    (str_fieldName, long_SORTED, "bytes", long_ordinal1) => (bytes_value)

    ...

    (str_fieldName, long_SORTED_SET, "ord", long_doc0) => (long_ordinal)

    (str_fieldName, long_SORTED_SET, "ord", long_doc1) => (long_ordinal)

    ...

    (str_fieldName, long_SORTED_SET, "bytes", long_ordinal0) => (bytes_value)

    (str_fieldName, long_SORTED_SET, "bytes", long_ordinal1) => (bytes_value)

    ...

    (str_fieldName, long_SORTED_SET, "doc_ord", long_doc0, long_ordinal0) => ()

    (str_fieldName, long_SORTED_SET, "doc_ord", long_doc0, long_ordinal1) => ()

    (str_fieldName, long_SORTED_SET, "doc_ord", long_doc1, long_ordinal0) => ()

    ...

### FieldInfosFormat

Encodes/decodes filed metadata. See

[FieldInfosFormat](https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/FieldInfosFormat.html)

and

[FieldInfos](https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/FieldInfos.html):

Subspace: `("inf")`

    (long_field0, "name") => (string_fieldName)

    (long_field0, "has_index") => (boolean_value)

    (long_field0, "has_payloads") => (boolean_value)

    (long_field0, "has_norms") => (boolean_value)

    (long_field0, "has_vectors") => (boolean_value)

    (long_field0, "doc_values_type") => (string_docValuesType)

    (long_field0, "norms_type") => (string_normsType)

    (long_field0, "index_options") => (string_indexOptions)

    (long_field0, "attr", string_attr0) => (string_value)

    (long_field0, "attr", string_attr1) => (string_value)

    ...

    (long_field1, "name") => (string_fieldName)

    ...

### LiveDocsFormat

Encodes/decodes live-ness of documents. See

[LiveDocsFormat](https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/LiveDocsFormat.html).

Subspace: `("liv")`

    (long_liveGen0) => (long_totalSize)

    (long_liveGen0, long_setBitIndex0) => ()

    (long_liveGen0, long_setBitIndex1) => ()

    (long_liveGen1) => (long_totalSize)

    ...

### NormsFormat

Encodes/decodes per-document score normalization values. See

[NormsFormat](https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/NormsFormat.html).

Subspace: `("len")`

_Uses `DocValuesFormat` with a different subspace extension._

### PostingsFormat

Encodes/decodes terms, postings, and proximity data. See

[PostingsFormat](https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/PostingsFormat.html).

Subspace: `("pst")`

    (long_field0, bytes_term0, "numDocs") => (littleEndianLong_value)

    (long_field0, bytes_term0, long_doc0) => (long_termDocFreq)

    (long_field0, bytes_term0, long_doc0, long_pos0) => (long_startOffset, long_endOffset, bytes_payload)

    ...

    (long_field1, bytes_term1, "numDocs") => (littleEndianLong_value)

    ...

### SegmentInfoFormat

Encodes/decodes segment metadata. See

[SegmentInfoFormat](https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/SegmentInfoFormat.html).

Subspace: `("si")`

    ("doc_count")=> (long_docCount)

    ("is_compound_file") => (boolean_value)

    ("version") => (long_version)

    ("attr", string_attr0) => (string_value)

    ("attr", string_attr1) => (string_value)

    ...

    ("diag", string_diag0) => (string_value)

    ("diag", string_diag1) => (string_value)

    ...

    ("file", string_file0) => ()

    ("file", string_file1) => ()

    ...

### StoredFieldsFormat

Encodes/decodes per-document fields. See

[StoredFieldsFormat](https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/SegmentInfoFormat.html).

The key parts `long_TYPE` and `long_DATA` below refer to constants values,

currently `0` and `1`.

Subspace: `("fld")`

    (long_doc0, long_TYPE, long_field0) => (string_typeName, long_dataIndex)

    (long_doc0, long_TYPE, long_field1) => (string_typeName, long_dataIndex)

    ...

    (long_doc0, long_DATA, long_field0, long_dataIndex, long_offset0) => (bytes_value)

    (long_doc0, long_DATA, long_field0, long_dataIndex, long_offset1) => (bytes_value)

    (long_doc0, long_DATA, long_field1, long_dataIndex, long_offset0) => (bytes_value)

    ...

    (long_doc1, long_TYPE, long_field0) => (string_typeName, long_dataIndex)

    ...

### TermVectorsFormat

Encodes/decodes per-document term vectors. See

[TermVectorsFormat](https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/TermVectorsFormat.html).

Subspace: `("vec")`

    (long_doc0, "field", string_field0) => (long_fieldNum, long_numTerms, boolean_hasPositions, boolean_hasOffsets, boolean_hasPayloads)

    (long_doc0, "field", string_field1) => (long_fieldNum, long_numTerms, boolean_hasPositions, boolean_hasOffsets, boolean_hasPayloads)

    ...

    (long_doc0, "term", string_field0, bytes_term0) => (long_freq)

    (long_doc0, "term", string_field0, bytes_term0, long_pos0) => (long_startOffset, long_endOffset, bytes_payload)

    (long_doc0, "term", string_field0, bytes_term0, long_pos1) => (long_startOffset, long_endOffset, bytes_payload)

    (long_doc0, "term", string_field0, bytes_term1) => (long_freq)

    ...

    (long_doc1, "field", string_fieldName0) => (long_fieldNum, long_numTerms, boolean_hasPositions, boolean_hasOffsets, boolean_hasPayloads)

    ...

## Running Built-In Tests

[Maven](http://maven.apache.org/) is used for building, packaging and running

tests.

    $ mvn test

## Running Lucene and Solr Tests

1. Package fdb-lucene-layer

        $ mvn package

2. Download the Solr source

        $ curl -O http://mirror.nexcess.net/apache/lucene/solr/4.4.0/solr-4.4.0-src.tgz

        $ tar xzf solr-4.4.0-src.tgz

        $ cd solr-4.4.0/

3. Run the full test suite

        $ ant test -Dtests.codec=FDBCodec \

                   -Dtests.directory=com.foundationdb.lucene.FDBTestDirectory \

                   -lib  ../target/fdb-lucene-layer-0.0.1-SNAPSHOT.jar \

                   -lib ../target/dependency/fdb-java-1.0.0.jar
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/AydinSakar/lucene-layer

Awesome Lists containing this project

README