https://github.com/memgraph/mgcxx
https://github.com/memgraph/mgcxx
Last synced: 10 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/memgraph/mgcxx
- Owner: memgraph
- License: mit
- Created: 2023-11-20T01:20:49.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-06T19:20:41.000Z (over 1 year ago)
- Last Synced: 2024-09-06T22:56:05.922Z (over 1 year ago)
- Language: Rust
- Size: 53.7 KB
- Stars: 1
- Watchers: 5
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# mgcxx (experimental)
A collection of C++ wrappers around non-C++ libraries.
The list includes:
* full-text search enabled by [tantivy](https://github.com/quickwit-oss/tantivy)
Requirements:
* cmake 3.15+
* rustup toolchain 1.75.0+
## How to build and test?
```
mkdir build && cd build
cmake ..
make && ctest
```
## text_search
### TODOs
- [ ] Polish & test all error messages
- [ ] Write unit / integration test to compare STRING vs JSON fiels search query syntax.
- [ ] Figure out what's the right search syntax for a property graph
- [ ] Add some notion of pagination
- [ ] Add some notion of backwards compatiblity -> some help to the user
- [ ] How to:
- [ ] search all properties
- [ ] fuzzy search
```
// let term = Term::from_field_text(data_field, &input.search_query);
// let query = FuzzyTermQuery::new(term, 2, true);
```
- [ ] Add Github Actions
- [ ] Add benchmarks:
- [ ] Test what's the tradeoff between searching STRING vs JSON TEXT, how does the query look like?
- [ ] Search direct field vs JSON, FAST vs SLOW, String vs CxxString
- [ ] MATCH (n) RETURN count(n), n.deleted;
- [ ] search of a specific property value
- [ ] benchmark (add|retrieve simple|complex, filtering, aggregations).
- [ ] search of all properties
- [ ] Benchmark (search by GID to get document_id + fetch document by document_id) vs (fetch document by document_id) on 100M nodes + 100M edges
- [ ] Note [DocAddress](https://docs.rs/tantivy/latest/tantivy/struct.DocAddress.html) is composed of 2 u32 but the `SegmentOrdinal` is tied to the `Searcher` -> is it possible/wise to cache the address (`SegmentId` is UUID)
- [ ] A [searcher](https://docs.rs/tantivy/latest/tantivy/struct.IndexReader.html#method.searcher) per transaction -> cache `DocAddress` inside Memgraph's `ElementAccessors`?
- [ ] Implement the stress test by adding & searching to the same index concurrently + large dataset generator.
- [ ] Consider implementing panic! handler preventing outside process to crash (optionally).
### NOTEs
* if a field doesn't get specified in the schema, it's ignored
* `TEXT` means the field will be tokenized and indexed (required to be able to
search)
* Tantivy add_json_object accepts serde_json::map::Map
* C++ text-search API is snake case because it's implemented in Rust
* Writing each document and then committing (writing to disk) will be
expensive. In a standard OLTP workload that's a common case -> introduce some
form of batching.
## Resources
* https://fulmicoton.com/posts/behold-tantivy-part2
* https://stackoverflow.com/questions/37924383/combining-several-static-libraries-into-one-using-cmake
--> decided to have 2 separate libraries user code has to link