https://github.com/memgraph/mgcxx

Last synced: 12 months ago
JSON representation

Host: GitHub
URL: https://github.com/memgraph/mgcxx
Owner: memgraph
License: mit
Created: 2023-11-20T01:20:49.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-09-06T19:20:41.000Z (almost 2 years ago)
Last Synced: 2024-09-06T22:56:05.922Z (almost 2 years ago)
Language: Rust
Size: 53.7 KB
Stars: 1
Watchers: 5
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # mgcxx (experimental)

A collection of C++ wrappers around non-C++ libraries.

The list includes:

  * full-text search enabled by [tantivy](https://github.com/quickwit-oss/tantivy)

Requirements:

  * cmake 3.15+

  * rustup toolchain 1.75.0+

## How to build and test?

```

mkdir build && cd build

cmake ..

make && ctest

```

## text_search

### TODOs

- [ ] Polish & test all error messages

- [ ] Write unit / integration test to compare STRING vs JSON fiels search query syntax.

- [ ] Figure out what's the right search syntax for a property graph

- [ ] Add some notion of pagination

- [ ] Add some notion of backwards compatiblity -> some help to the user

- [ ] How to:

    - [ ] search all properties

    - [ ] fuzzy search

          ```

          // let term = Term::from_field_text(data_field, &input.search_query);

          // let query = FuzzyTermQuery::new(term, 2, true);

          ```

- [ ] Add Github Actions

- [ ] Add benchmarks:

    - [ ] Test what's the tradeoff between searching STRING vs JSON TEXT, how does the query look like?

    - [ ] Search direct field vs JSON, FAST vs SLOW, String vs CxxString

    - [ ] MATCH (n) RETURN count(n), n.deleted;

    - [ ] search of a specific property value

    - [ ] benchmark (add|retrieve simple|complex, filtering, aggregations).

    - [ ] search of all properties

    - [ ] Benchmark (search by GID to get document_id + fetch document by document_id) vs (fetch document by document_id) on 100M nodes + 100M edges

        - [ ] Note [DocAddress](https://docs.rs/tantivy/latest/tantivy/struct.DocAddress.html) is composed of 2 u32 but the `SegmentOrdinal` is tied to the `Searcher` -> is it possible/wise to cache the address (`SegmentId` is UUID)

            - [ ] A [searcher](https://docs.rs/tantivy/latest/tantivy/struct.IndexReader.html#method.searcher) per transaction -> cache `DocAddress` inside Memgraph's `ElementAccessors`?

- [ ] Implement the stress test by adding & searching to the same index concurrently + large dataset generator.

- [ ] Consider implementing panic! handler preventing outside process to crash (optionally).

### NOTEs

* if a field doesn't get specified in the schema, it's ignored

* `TEXT` means the field will be tokenized and indexed (required to be able to

  search)

* Tantivy add_json_object accepts serde_json::map::Map

* C++ text-search API is snake case because it's implemented in Rust

* Writing each document and then committing (writing to disk) will be

  expensive. In a standard OLTP workload that's a common case -> introduce some

  form of batching.

## Resources

* https://fulmicoton.com/posts/behold-tantivy-part2

* https://stackoverflow.com/questions/37924383/combining-several-static-libraries-into-one-using-cmake

    --> decided to have 2 separate libraries user code has to link

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/memgraph/mgcxx

Awesome Lists containing this project

README