{"id":39553058,"url":"https://github.com/seekstorm/seekstorm","last_synced_at":"2026-04-19T19:00:47.461Z","repository":{"id":224757529,"uuid":"764081389","full_name":"SeekStorm/SeekStorm","owner":"SeekStorm","description":"SeekStorm: vector \u0026 lexical search - in-process library \u0026 multi-tenancy server, in Rust.","archived":false,"fork":false,"pushed_at":"2026-04-19T17:46:46.000Z","size":9900,"stargazers_count":1866,"open_issues_count":18,"forks_count":66,"subscribers_count":9,"default_branch":"main","last_synced_at":"2026-04-19T18:29:10.267Z","etag":null,"topics":["ai-search","bm25","dense-retrieval","enterprise-search","faceting","full-text-search","geosearch","hybrid-search","lexical-search","neural-search","realtime","search","search-engine","search-server","search-service","semantic-search","sparse-retrieval","vector-database","vector-search","vector-search-engine"],"latest_commit_sha":null,"homepage":"https://seekstorm.com","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SeekStorm.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-02-27T12:55:08.000Z","updated_at":"2026-04-19T17:46:51.000Z","dependencies_parsed_at":"2024-03-16T14:52:45.454Z","dependency_job_id":"addf93ef-a1ab-4db7-b1b3-9fe5737f6106","html_url":"https://github.com/SeekStorm/SeekStorm","commit_stats":null,"previous_names":["seekstorm/seekstorm"],"tags_count":83,"template":false,"template_full_name":null,"purl":"pkg:github/SeekStorm/SeekStorm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeekStorm%2FSeekStorm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeekStorm%2FSeekStorm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeekStorm%2FSeekStorm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeekStorm%2FSeekStorm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SeekStorm","download_url":"https://codeload.github.com/SeekStorm/SeekStorm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeekStorm%2FSeekStorm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32018764,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T20:23:30.271Z","status":"online","status_checked_at":"2026-04-19T02:00:07.110Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-search","bm25","dense-retrieval","enterprise-search","faceting","full-text-search","geosearch","hybrid-search","lexical-search","neural-search","realtime","search","search-engine","search-server","search-service","semantic-search","sparse-retrieval","vector-database","vector-search","vector-search-engine"],"created_at":"2026-01-18T06:59:07.353Z","updated_at":"2026-04-19T19:00:47.453Z","avatar_url":"https://github.com/SeekStorm.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003cimg src=\"assets/logo.png\" width=\"450\" alt=\"Logo\"\u003e\u003cbr\u003e\n[![Crates.io](https://img.shields.io/crates/v/seekstorm.svg)](https://crates.io/crates/seekstorm)\n[![Downloads](https://img.shields.io/crates/d/seekstorm.svg?style=flat-square)](https://crates.io/crates/seekstorm)\n[![Documentation](https://docs.rs/seekstorm/badge.svg)](https://docs.rs/seekstorm)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/SeekStorm/SeekStorm?tab=Apache-2.0-1-ov-file#readme)\n[![Docker](https://img.shields.io/docker/pulls/wolfgarbe/seekstorm_server)](https://hub.docker.com/r/wolfgarbe/seekstorm_server)\n[![Roadmap](https://img.shields.io/badge/Roadmap-2026-DA7F07.svg)](#roadmap)\n\u003cp\u003e\n  \u003ca href=\"https://seekstorm.com\"\u003eWebsite\u003c/a\u003e | \n  \u003ca href=\"https://seekstorm.github.io/search-benchmark-game/\"\u003eBenchmark\u003c/a\u003e | \n  \u003ca href=\"https://deephn.org/\"\u003eDemo\u003c/a\u003e | \n  \u003ca href=\"#documentation\"\u003eLibrary Docs\u003c/a\u003e | \n  \u003ca href=\"https://seekstorm.apidocumentation.com/reference\"\u003eServer Docs\u003c/a\u003e |\n  \u003ca href=\"https://github.com/SeekStorm/SeekStorm/blob/main/src/seekstorm_server/README.md\"\u003eServer Readme\u003c/a\u003e |\n  \u003ca href=\"#roadmap\"\u003eRoadmap\u003c/a\u003e | \n  \u003ca href=\"https://seekstorm.com/blog/\"\u003eBlog\u003c/a\u003e | \n  \u003ca href=\"https://x.com/seekstorm\"\u003eTwitter\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n**SeekStorm**: **sub-millisecond**, native **vector** \u0026 **lexical search** - **in-process library** \u0026 **multi-tenancy server**, in **Rust**.\n\nDevelopment started in 2015, in [production](https://seekstorm.com) since 2020, Rust port in 2023, open sourced in 2024, work in progress.\n\nSeekStorm is open source licensed under the [Apache License 2.0](https://github.com/SeekStorm/SeekStorm?tab=Apache-2.0-1-ov-file#readme)\n\nBlog Posts: [SeekStorm is now Open Source](https://seekstorm.com/blog/sneak-peek-seekstorm-rust/) and [SeekStorm gets Faceted search, Geo proximity search, Result sorting](https://seekstorm.com/blog/faceted_search-geo-proximity-search/)\n\n### SeekStorm high-performance search library\n\n#### Hybrid search\n* Internally, SeekStorm uses [**two separate, first-class, native index architectures**](ARCHITECTURE.md#architecture) for **vector search** and **keyword search**. Two native cores, not just a retrofit, add-on layer.\n* SeekStorm doesn’t try to make one index do everything. It runs two native search engines and lets the query planner decide how to combine them.\n* Two **native** index architectures under one roof:\n  - **Lexical search**: an inverted index optimized for lexical relevance, \n  - **Vector search**: an ANN index optimized for vector similarity.\n* Both are first-class engines, integrated at the query planner level.\n  - Query planner with multiple QueryModes and FusionTypes\n  - **Per query choice** of lexical search, **vector search**, or **hybrid search**.\n* Separate storage layouts, separate indexing pipelines, separate execution paths, unified query planner and result fusion (Reciprocal Rank Fusion - RRF).\n* Two independent scorers, two independent top-k candidates: late fusion with intent, not score soup, no score normalization hell.\n* The user is fully shielded from the complexity as if it was only a single index.\n* Enables pure lexical, pure vector or hybrid search (exhaustive, not only re-ranking of preliminary candidates). \n\n#### Architecture\n* *Fast* sharded indexing: 35K docs/sec = 3 billion docs/day on a laptop.\n* *Fast* sharded search: [7x faster query latency, 17x faster tail latency (P99)](#benchmarks) for lexical search.\n* Billion-scale index\n* Index either in RAM or memory mapped files\n* Cross-platform (Windows, Linux, MacOS)\n* SIMD (Single Instruction, Multiple Data) hardware acceleration support,  \n  both for x86-64 (AMD64 and Intel 64) and AArch64 (ARM, Apple Silicon).\n* Single-machine scalability: serving thousands of concurrent queries with low latency from a single commodity server without needing clusters or proprietary hardware accelerators.\n* 100% human 😎 craftsmanship - No AI 🤖 was forced into vibe coding/AI slop.\n\n#### Vector Features\n* **Multi-Vector indexing**: both from multiple fields and from multiple chunks per field.\n* **Integrated inference**: Generate and index embeddings from any text document field.\n* Alternatively, import and index externally generated embeddings.\n* Multiple vector precisions: F32, I8.\n* Multiple similarity measures: Cosine similarity, Dot product, Euclidean distance.\n* **Scalar Quantization** (SQ).\n* **Chunking** that respects **sentence boundaries** and **Unicode segmentation** for multilingual text.\n* **K-Medoid clustering**: PAM (Partition Around Medoids) with actual data points as centers.\n* **Sharded and leveled IVF index**.\n* **Approximate Nearest Neighbor Search** (ANNS) in an **Leveled IVF index**.\n* All **field filters** are directly active **during vector search**, not just as post-search filtering step.\n\n#### Lexical Features\n* **BM25F** and **BM25F_Proximity** ranking\n* 6 tokenizers, including **Chinese word segmentation**.\n* **Stemming** for 38 languages.\n* Optional **stopword lists**, custom and predefined, for smaller indices and faster search.\n* **Frequent word lists**, custom and predefined, for faster phrase search by N-gram indexing.\n* Inverted index\n* **Roaring-bitmap** posting list compression.\n* **N-gram indexing**\n* **Block-max WAND** and **Maxscore** acceleration\n\n#### General Features\n* **True real-time search**, both for **vector search** and **lexical search**, with negligible performance impact\n* Incremental indexing\n* Unlimited field number, field length \u0026 index size\n* Compressed document store: ZStandard\n* Field filtering\n* [Faceted search](https://github.com/SeekStorm/SeekStorm/blob/main/FACETED_SEARCH.md): Counting \u0026 filtering of String \u0026 Numeric range facets (with Histogram/Bucket \u0026 Min/Max aggregation)\n* Result sorting by any field, ascending or descending, multiple fields combined by \"tie-breaking\". \n* Geo proximity search, filtering and sorting.\n* Iterator to iterate through all documents of an index, in both directions, e.g., for index export, conversion, analytics and inspection.  \n* Search with empty query, but query facets, facet filter, and result sort parameters, ascending and descending.\n* Typo tolerance / Fuzzy queries / Query spelling correction: return results if the query contains spelling errors.\n* Typo-tolerant Query Auto-Completion (QAC) and Instant search.\n* KWIC snippets, highlighting\n* One-way and multi-way synonyms\n* Language independent\n\n#### Field types\n+ U8..U64 \n+ I8..I64 \n+ F32, F64 \n+ Timestamp \n+ Bool\n+ String16, String32 \n+ StringSet16, StringSet32\n+ Text (Multi-vector: **automatically generated embeddings** for each text field)\n+ Point\n+ Json\n+ Binary (embedded images, audio, video, pdf)\n+ Vector (**externally generated embeddings**)\n\n#### Query types\n+ OR  disjunction  union\n+ AND conjunction intersection\n+ \"\"  phrase\n+ \\-   NOT\n\n#### Result types\n+ TopK\n+ Count\n+ TopKCount\n\n### SeekStorm multi-tenancy search server \n\n* Index and search via [RESTful API](https://github.com/SeekStorm/SeekStorm/blob/main/src/seekstorm_server#rest-api-endpoints) with CORS.\n* Ingest local data files in [CSV](https://en.wikipedia.org/wiki/Comma-separated_values), [JSON](https://en.wikipedia.org/wiki/JSON), [Newline-delimited JSON](https://github.com/ndjson/ndjson-spec) (ndjson), and [Concatenated JSON](https://en.wikipedia.org/wiki/JSON_streaming) formats via console command.  \n* Ingest local PDF files via console command (single file or all files in a directory).\n* Multi-tenancy index management.\n* API-key management.\n* [Embedded web server and web UI](https://github.com/SeekStorm/SeekStorm/blob/main/src/seekstorm_server#open-embedded-web-ui-in-browser) to search and display results from any index without coding.\n* Web UI with query auto correction, query auto-completion, instant search, keyword highlighting, histogram, date filter, faceting, result sorting, document preview (as demo, for testing, as template).\n* Code first OpenAPI generated [REST API documentation](https://seekstorm.apidocumentation.com/reference)\n* Cross-platform: runs on Linux, Windows, and macOS (other OS untested).\n* Docker file and container image at [Docker Hub](https://hub.docker.com/r/wolfgarbe/seekstorm_server)\n\n---\n\n## Why SeekStorm?\n\n**Twin-core native vector \u0026 keyword search**  \n[Two separate, first-class, native index architectures](ARCHITECTURE.md#architecture) for **vector search** and **keyword search** under one roof.  \nA query planner with 8 dedicated QueryModes and FusionTypes automatically decide how to combine the results for maximum query understanding.\n\n**Performance**  \nLower latency, higher throughput, lower cost \u0026 energy consumption, esp. for multi-field and concurrent queries.  \nLow tail latencies ensure a smooth user experience and prevent loss of customers and revenue.  \nWhile some rely on proprietary hardware accelerators (FPGA/ASIC) or clusters to improve performance,  \nSeekStorm achieves a similar boost algorithmically on a single commodity server.\n\n**Consistency**  \nNo unpredictable query latency during and after large-volume indexing as SeekStorm doesn't require resource-intensive segment merges.  \nStable latencies - no cold start costs due to just-in-time compilation, no unpredictable garbage collection delays.  \n\n**Scaling**  \nMaintains low latency, high throughput, and low RAM consumption even for billion-scale indices.  \nUnlimited field number, field length \u0026 index size.\n\n**Relevance**  \nTerm proximity ranking provides more relevant results compared to BM25.\n\n**Real-time**  \nTrue real-time search, as opposed to NRT: every indexed document is immediately searchable, even before and during commit.\n\n## Benchmarks\n\n\u003cimg src=\"assets/search_benchmark_game1.png\" width=\"800\" alt=\"Benchmark\"\u003e\n\u003cbr\u003e\n\u003cbr\u003e\n\u003cimg src=\"assets/search_benchmark_game2.png\" width=\"800\" alt=\"Benchmark\"\u003e\n\u003cbr\u003e\n\u003cbr\u003e\n\u003cimg src=\"assets/search_benchmark_game3.png\" width=\"800\" alt=\"Benchmark\"\u003e\n\u003cbr\u003e\n\u003cbr\u003e\n\u003cimg src=\"assets/ranking.jpg\" width=\"800\" alt=\"Ranking\"\u003e\n\n*the who: vanilla BM25 ranking vs. SeekStorm proximity ranking*\u003cbr\u003e\u003cbr\u003e\n\n**Methodology**  \nComparing different open-source search engine libraries (BM25 lexical search) using the open-source **search_benchmark_game** developed by [Tantivy](https://github.com/quickwit-oss/search-benchmark-game/) and [Jason Wolfe](https://github.com/jason-wolfe/search-index-benchmark-game).\n\n**Benefits**\n+ using a proven open-source benchmark used by other search libraries for comparability\n+ adapters written mostly by search library authors themselves for maximum authenticity and faithfulness\n+ results can be replicated by everybody on their own infrastructure\n+ detailed results per query, per query type and per result type to investigate optimization potential\n\n**Detailed benchmark results**\nhttps://seekstorm.github.io/search-benchmark-game/\n\n**Benchmark code repository**\nhttps://github.com/SeekStorm/search-benchmark-game/\n\nSee our **blog posts** for more detailed information: [SeekStorm is now Open Source](https://seekstorm.com/blog/sneak-peek-seekstorm-rust/) and [SeekStorm gets Faceted search, Geo proximity search, Result sorting](https://seekstorm.com/blog/faceted_search-geo-proximity-search/)\n\n### Why latency matters\n\n* Search speed might be good enough for a single search. Below 10 ms people can't tell latency anymore. Search latency might be small compared to internet network latency.\n* But search engine performance still matters when used in a server or service for many concurrent users and requests for maximum scaling, throughput, low processor load, and cost.\n* With performant search technology, you can serve many concurrent users at low latency with fewer servers, less cost, less energy consumption, and a lower carbon footprint.\n* It also ensures low latency even for complex and challenging queries: instant search, fuzzy search, faceted search, and union/intersection/phrase of very frequent terms.\n* Local search performance matters, e.g. when many local queries are spawned for reranking, fallback/refinement queries, fuzzy search, data mining or RAG befor the response is transferred back over the network.\n* Besides average latencies, we also need to reduce tail latencies, which are often overlooked but can cause loss of customers, revenue, and a bad user experience.\n* It is always advisable to engineer your search infrastructure with enough performance headroom to keep those tail latencies in check, even during periods of high concurrent load.\n* Also, even if a human user might not notice the latency, it still might make a big difference in autonomous stock markets, defense applications or RAG which requires multiple queries.\n\n---\n\n## Keyword search remains a core building block in the advent of vector search and LLMs\n\nDespite what the hype-cycles https://www.bitecode.dev/p/hype-cycles want you to believe, keyword search is not dead, as NoSQL wasn't the death of SQL.\n\nYou should maintain a toolbox, and choose the best tool for your task at hand. https://seekstorm.com/blog/vector-search-vs-keyword-search1/\n\nKeyword search is just a filter for a set of documents, returning those where certain keywords occur in, usually combined with a ranking metric like BM25.\nA very basic and core functionality is very challenging to implement at scale with low latency.\nBecause the functionality is so basic, there is an unlimited number of application fields.\nIt is a component, to be used together with other components.\nThere are use cases which can be solved better today with vector search and LLMs, but for many more keyword search is still the best solution.\nKeyword search is exact, lossless, and it is very fast, with better scaling, better latency, lower cost and energy consumption.\nVector search works with semantic similarity, returning results within a given proximity and probability. \n\n### Keyword search (lexical search)\nIf you search for exact results like proper names, numbers, license plates, domain names, and phrases (e.g. plagiarism detection) then keyword search is your friend. Vector search, on the other hand, will bury the exact result that you are looking for among a myriad of results that are only somehow semantically related. At the same time, if you don’t know the exact terms, or you are interested in a broader topic, meaning or synonym, no matter what exact terms are used, then keyword search will fail you.\n\n```diff\n- works with text data only\n- unable to capture context, meaning and semantic similarity\n- low recall for semantic meaning\n+ perfect recall for exact keyword match \n+ perfect precision (for exact keyword match)\n+ high query speed and throughput (for large document numbers)\n+ high indexing speed (for large document numbers)\n+ incremental indexing fully supported\n+ smaller index size\n+ lower infrastructure cost per document and per query, lower energy consumption\n+ good scalability (for large document numbers)\n+ perfect for exact keyword and phrase search, no false positives\n+ perfect explainability\n+ efficient and lossless for exact keyword and phrase search\n+ works with new vocabulary out of the box\n+ works with any language out of the box\n+ works perfect with long-tail vocabulary out of the box\n+ works perfect with any rare language or domain-specific vocabulary out of the box\n+ RAG (Retrieval-augmented generation) based on keyword search offers unrestricted real-time capabilities.\n```\n\n\n### Vector search\nVector search is perfect if you don’t know the exact query terms, or you are interested in a broader topic, meaning or synonym, no matter what exact query terms are used. But if you are looking for exact terms, e.g. proper names, numbers, license plates, domain names, and phrases (e.g. plagiarism detection) then you should always use keyword search. Vector search will instead bury the exact result that you are looking for among a myriad of results that are only somehow related. It has a good recall, but low precision, and higher latency. It is prone to false positives, e.g., in plagiarism detection as exact words and word order get lost.\n\nVector search enables you to search not only for similar text, but for everything that can be transformed into a vector: text, images (face recognition, fingerprints), audio, enabling you to do magic things like \"queen - woman + man = king.\"\n\n```diff\n+ works with any data that can be transformed to a vector: text, image, audio ...\n+ able to capture context, meaning, and semantic similarity\n+ high recall for semantic meaning (90%)\n- lower recall for exact keyword match (for Approximate Similarity Search)\n- lower precision (for exact keyword match)\n- lower query speed and throughput (for large document numbers)\n- lower indexing speed (for large document numbers)\n- incremental indexing is expensive and requires rebuilding the entire index periodically, which is extremely time-consuming and resource intensive.\n- larger index size\n- higher infrastructure cost per document and per query, higher energy consumption\n- limited scalability (for large document numbers)\n- unsuitable for exact keyword and phrase search, many false positives\n- low explainability makes it difficult to spot manipulations, bias and root cause of retrieval/ranking problems\n- inefficient and lossy for exact keyword and phrase search\n- Additional effort and cost to create embeddings and keep them updated for every language and domain. Even if the number of indexed documents is small, the embeddings have to created from a large corpus before nevertheless.\n- Limited real-time capability due to limited recency of embeddings\n- works only with vocabulary known at the time of embedding creation\n- works only with the languages of the corpus from which the embeddings have been derived\n- works only with long-tail vocabulary that was sufficiently represented in the corpus from which the embeddings have been derived\n- works only with rare language or domain-specific vocabulary that was sufficiently represented in the corpus from which the embeddings have been derived\n- RAG (Retrieval-augmented generation) based on vector search offers only limited real-time capabilities, as it can't process new vocabulary that arrived after the embedding generation\n```\n\n\u003cbr\u003e\n\n\u003e **Vector search is not a replacement for keyword search, but a complementary addition** - best to be used within a hybrid solution where the strengths of both approaches are combined. **Keyword search is not outdated, but time-proven**.\n\n---\n\n## Why Rust\n\nWe have (partially) ported the SeekStorm codebase from C# to Rust\n+ Factor 2..4x performance gain vs. C# (latency and throughput)\n+ No slow first run (no cold start costs due to just-in-time compilation)\n+ Stable latencies (no garbage collection delays)\n+ Less memory consumption (no ramping up until the next garbage collection)\n+ No framework dependencies (CLR or JVM virtual machines)\n+ Ahead-of-time instead of just-in-time compilation\n+ Memory safe language https://www.whitehouse.gov/oncd/briefing-room/2024/02/26/press-release-technical-report/ \n\nRust is great for performance-critical applications 🚀 that deal with big data and/or many concurrent users. \nFast algorithms will shine even more with a performance-conscious programming language 🙂\n\n---\n\n## Architecture\n\nsee [ARCHITECTURE.md](https://github.com/SeekStorm/SeekStorm/blob/main/ARCHITECTURE.md) \n\n---\n\n### Building\n\n```text\ncargo build --release\n```\n\n\u0026#x26A0; **WARNING**: make sure to set the MASTER_KEY_SECRET environment variable to a secret, otherwise your generated API keys will be compromised.\n\n### Documentation\n\n[https://docs.rs/seekstorm](https://docs.rs/seekstorm)\n\n**Build documentation**\n\n```text\ncargo doc --no-deps\n```\n**Access documentation locally**\n\nSeekStorm\\target\\doc\\seekstorm\\index.html  \nSeekStorm\\target\\doc\\seekstorm_server\\index.html  \n\n### Feature Flags\n\n- **`zh` (default)**: Enables TokenizerType.UnicodeAlphanumericZH that implements Chinese word segmentation to segment continuous Chinese text into tokens for indexing and search.\n- **`pdf` (default)**: Enables PDF ingestion via `pdfium` crate.\n- **`vb`**: vb (verbose) adds additional properties to the `Result` struct:\n  - field_id\n  - chunk_id\n  - level_id\n  - shard_id\n  - cluster_id\n  - cluster_score\n  - vector_score\n  - lexical_score\n  - source: ResultSource (Lexical/Vector/Hybrid)\n\nYou can disable the SeekStorm default features by using default-features = false in the cargo.toml of your application.  \nThis can be useful to reduce the size of your application or if there are dependency version conflicts.\n```cargo\n[dependencies]\nseekstorm = { version = \"0.12.19\", default-features = false }\n```\n\n## Usage of the library\n\n### Lexical search\n\nAdd required crates to your project\n```text\ncargo add seekstorm\ncargo add tokio\ncargo add serde_json\n```\n\nUse an asynchronous Rust runtime\n```rust\nuse std::error::Error;\n#[tokio::main]\nasync fn main() -\u003e Result\u003c(), Box\u003cdyn Error + Send + Sync\u003e\u003e {\n\n  // your SeekStorm code here\n\n   Ok(())\n}\n```\n\ncreate schema (from JSON)\n```rust\nuse seekstorm::index::SchemaField;\n\nlet schema_json = r#\"\n[{\"field\":\"title\",\"field_type\":\"Text\",\"store\":false,\"index_lexical\":false,\"dictionary_source\":true,\"completion_source\":true},\n{\"field\":\"body\",\"field_type\":\"Text\",\"store\":true,\"index_lexical\":true},\n{\"field\":\"url\",\"field_type\":\"Text\",\"store\":false,\"index_lexical\":false}]\"#;\nlet schema:Vec\u003cSchemaField\u003e=serde_json::from_str(schema_json).unwrap();\n```\n\ncreate schema (from SchemaField)\n```rust\nuse seekstorm::index::{SchemaField,FieldType};\n\nlet schema= vec![\n    SchemaField::new(\"title\".to_owned(), false, false,false, FieldType::Text, false,false, 1.0,true,true),\n    SchemaField::new(\"body\".to_owned(),true,true,false,FieldType::Text,false,true,1.0,false,false),\n    SchemaField::new(\"url\".to_owned(), false, false,false, FieldType::Text,false,false,1.0,false,false),\n];\n```\n\ncreate index\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse std::path::Path;\nuse seekstorm::index::{IndexMetaObject, Clustering, LexicalSimilarity,TokenizerType,StopwordType,FrequentwordType,AccessType,StemmerType,NgramSet,SchemaField,FieldType,SpellingCorrection,QueryCompletion,DocumentCompression,create_index};\nuse seekstorm::vector::Inference;\nuse seekstorm::vector_similarity::VectorSimilarity;\n\nlet index_path=Path::new(\"C:/index/\");\n\nlet schema= vec![\n    SchemaField::new(\"title\".to_owned(), false, false,false, FieldType::Text, false,false, 1.0,true,true),\n    SchemaField::new(\"body\".to_owned(),true,true,false,FieldType::Text,false,true,1.0,false,false),\n    SchemaField::new(\"url\".to_owned(), false, false, false,FieldType::Text,false,false,1.0,false,false),\n];\n\nlet meta = IndexMetaObject {\n    id: 0,\n    name: \"test_index\".into(),\n    lexical_similarity: LexicalSimilarity::Bm25f,\n    tokenizer: TokenizerType::UnicodeAlphanumeric,\n    stemmer: StemmerType::None,\n    stop_words: StopwordType::None,\n    frequent_words: FrequentwordType::English,\n    ngram_indexing: NgramSet::NgramFF as u8,\n    document_compression: DocumentCompression::Snappy,\n    access_type: AccessType::Mmap,\n    spelling_correction: Some(SpellingCorrection { max_dictionary_edit_distance: 1, term_length_threshold: Some([2,8].into()),count_threshold: 20,max_dictionary_entries:500_000 }),\n    query_completion: Some(QueryCompletion{max_completion_entries:10_000_000}),\n    clustering: Clustering::None,\n    inference: Inference::None,\n};\n\nlet segment_number_bits1=11;\nlet index_arc=create_index(index_path,meta,\u0026schema,\u0026Vec::new(),segment_number_bits1,false,None).await.unwrap();\n\n# });\n```\n\nopen index (alternatively to create index)\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse std::path::Path;\nuse seekstorm::index::open_index;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap(); \n\n# });\n```\n\nindex documents (from JSON)\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse std::path::Path;\nuse seekstorm::index::{open_index, IndexDocuments};\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap(); \n\nlet documents_json = r#\"\n[{\"title\":\"title1 test\",\"body\":\"body1\",\"url\":\"url1\"},\n{\"title\":\"title2\",\"body\":\"body2 test\",\"url\":\"url2\"},\n{\"title\":\"title3 test\",\"body\":\"body3 test\",\"url\":\"url3\"}]\"#;\nlet documents_vec=serde_json::from_str(documents_json).unwrap();\n\nindex_arc.index_documents(documents_vec).await; \n\n# });\n```\n\nindex document (from Document)\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::index::{FileType, Document, IndexDocument, open_index};\nuse std::path::Path;\nuse serde_json::Value;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap(); \n\nlet document= Document::from([\n    (\"title\".to_string(), Value::String(\"title4 test\".to_string())),\n    (\"body\".to_string(), Value::String(\"body4 test\".to_string())),\n    (\"url\".to_string(), Value::String(\"url4\".to_string())),\n]);\n\nindex_arc.index_document(document,FileType::None).await;\n\n# });\n```\n\ncommit documents\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::commit::Commit;\nuse seekstorm::index::open_index;\nuse std::path::Path;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap(); \n\nindex_arc.commit().await;\n\n# });\n```\n\nsearch index\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::search::{Search, SearchMode, QueryType, ResultType, QueryRewriting};\nuse seekstorm::index::open_index;\nuse std::path::Path;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap(); \n\nlet query=\"test\".to_string();\nlet query_vector=None;\nlet search_mode=SearchMode::Lexical;\nlet enable_empty_query=false;\nlet offset=0;\nlet length=10;\nlet query_type=QueryType::Intersection; \nlet result_type=ResultType::TopkCount;\nlet include_uncommitted=false;\nlet field_filter=Vec::new();\nlet query_facets=Vec::new();\nlet facet_filter=Vec::new();\nlet result_sort=Vec::new();\nlet query_rewriting= QueryRewriting::SearchRewrite { distance: 1, term_length_threshold: Some([2,8].into()), correct:Some(2),complete: Some(3), length: Some(5) };\nlet result_object = index_arc.search(query, query_vector, query_type, search_mode, enable_empty_query, offset, length, result_type,include_uncommitted,field_filter,query_facets,facet_filter,result_sort,query_rewriting).await;\n\n// ### display results\n\nuse seekstorm::highlighter::{Highlight, highlighter};\nuse std::collections::HashSet;\n\nlet highlights:Vec\u003cHighlight\u003e= vec![\n    Highlight {\n        field: \"body\".to_string(),\n        name:String::new(),\n        fragment_number: 2,\n        fragment_size: 160,\n        highlight_markup: true,\n        ..Default::default()\n    },\n];    \n\nlet highlighter=Some(highlighter(\u0026index_arc,highlights, result_object.query_terms).await);\nlet return_fields_filter= HashSet::new();\nlet distance_fields=Vec::new();\nlet mut index=index_arc.write().await;\nfor result in result_object.results.iter() {\n  let doc=index.get_document(result.doc_id,false,\u0026highlighter,\u0026return_fields_filter,\u0026distance_fields).await.unwrap();\n  println!(\"result {} rank {} body field {:?}\" , result.doc_id,result.score, doc.get(\"body\"));\n}\n\nprintln!(\"result counts {} {} {}\",result_object.results.len(), result_object.result_count, result_object.result_count_total);\n\n// ### display suggestions\n\nprintln!(\"original query string: {} query string after correction/completion {}\",result_object.original_query, result_object.query);\n\nfor suggesion in result_object.suggestions.iter() {\n    println!(\"suggestion: {}\", suggesion);\n}\n\n# })\n```\n\n*Query operators and query type*\n\nBoolean queries are specified in the search method either via the query_type parameter or via operator chars within the query parameter.  \nThe interpretation of operator chars within the query string (set `query_type=QueryType::Union`) allows to specify advanced search operations via a simple search box.\n\nIntersection, AND `+`\n```rust ,no_run\nuse seekstorm::search::QueryType;\nlet query_type=QueryType::Union; \nlet query=\"+red +apple\".to_string();\n```\n\n```rust ,no_run\nuse seekstorm::search::QueryType;\nlet query_type=QueryType::Intersection; \nlet query=\"red apple\".to_string();\n```\n\nUnion, OR\n```rust ,no_run\nuse seekstorm::search::QueryType;\nlet query_type=QueryType::Union; \nlet query=\"red apple\".to_string();\n```\n\nPhrase `\"\"`\n```rust ,no_run\nuse seekstorm::search::QueryType;\nlet query_type=QueryType::Union; \nlet query=\"\\\"red apple\\\"\".to_string();\n```\n\n```rust ,no_run\nuse seekstorm::search::QueryType;\nlet query_type=QueryType::Phrase; \nlet query=\"red apple\".to_string();\n```\n\nExcept, minus, NOT `-`\n```rust ,no_run\nuse seekstorm::search::QueryType;\nlet query_type=QueryType::Union; \nlet query=\"apple -red\".to_string();\n```\n\nMixed phrase and intersection\n```rust ,no_run\nuse seekstorm::search::QueryType;\nlet query_type=QueryType::Union; \nlet query=\"+\\\"the who\\\" +uk\".to_string();\n```\n\n\nmulti-threaded search\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::search::{QueryType, SearchMode, ResultType, QueryRewriting, Search};\nuse std::sync::Arc;\nuse tokio::sync::Semaphore;\nuse seekstorm::index::open_index;\nuse std::path::Path;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap(); \n\nlet query_vec=vec![\"house\".to_string(),\"car\".to_string(),\"bird\".to_string(),\"sky\".to_string()];\nlet query_vector=None;\nlet offset=0;\nlet length=10;\nlet search_mode=SearchMode::Lexical;\nlet enable_empty_query=false;\nlet query_type=QueryType::Union; \nlet result_type=ResultType::TopkCount;\n\nlet include_uncommitted=false;\nlet field_filter=Vec::new();\nlet query_facets=Vec::new();\nlet facet_filter=Vec::new();\nlet result_sort=Vec::new();\n\nlet thread_number = 4;\nlet permits = Arc::new(Semaphore::new(thread_number));\nfor query in query_vec {\n    let permit_thread = permits.clone().acquire_owned().await.unwrap();\n\n    let query_clone = query.clone();\n    let query_vector_clone=query_vector.clone();\n    let index_arc_clone = index_arc.clone();\n    let search_mode_clone=search_mode.clone();\n    let enable_empty_query_clone=enable_empty_query.clone();\n    let offset_clone = offset;\n    let length_clone = length;\n    let query_type_clone = query_type.clone();\n    let result_type_clone = result_type.clone();\n    let include_uncommitted_clone=include_uncommitted;\n    let field_filter_clone=field_filter.clone();\n    let query_facets_clone=query_facets.clone();\n    let facet_filter_clone=facet_filter.clone();\n    let result_sort_clone=result_sort.clone();\n\n    tokio::spawn(async move {\n        let rlo = index_arc_clone\n            .search(\n                query_clone,\n                query_vector_clone,\n                query_type_clone,\n                search_mode_clone,\n                enable_empty_query_clone,\n                offset_clone,\n                length_clone,\n                result_type_clone,\n                include_uncommitted_clone,\n                field_filter_clone,\n                query_facets_clone,\n                facet_filter_clone,\n                result_sort_clone,\n                QueryRewriting::SearchOnly\n            )\n            .await;\n\n        println!(\"result count {}\", rlo.result_count);\n        \n        drop(permit_thread);\n    });\n}\n\n# })\n```\n\nFirst, you need to create an index with a schema matching the JSON file fields to ingest:\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse std::path::Path;\nuse seekstorm::index::{IndexMetaObject,Clustering,LexicalSimilarity,TokenizerType,StopwordType,FrequentwordType,AccessType,StemmerType,NgramSet,SchemaField,FieldType,SpellingCorrection,QueryCompletion,DocumentCompression,create_index};\nuse seekstorm::vector::Inference;\nuse seekstorm::vector_similarity::VectorSimilarity;\n\nlet index_path=Path::new(\"C:/index/\");\n\nlet schema= vec![\n    // field, stored, indexed, field_type, facet, longest, boost\n    SchemaField::new(\"title\".to_owned(), true, true, false,FieldType::Text, false,false, 10.0,false,false),\n    SchemaField::new(\"body\".to_owned(),true,true,false,FieldType::Text,false,true,1.0,false,false),\n    SchemaField::new(\"url\".to_owned(), true, false,false, FieldType::Text,false,false,1.0,false,false),\n];\n\nlet meta = IndexMetaObject {\n    id: 0,\n    name: \"wikipedia_index\".into(),\n    lexical_similarity: LexicalSimilarity::Bm25f,\n    tokenizer: TokenizerType::UnicodeAlphanumeric,\n    stemmer: StemmerType::None,\n    stop_words: StopwordType::None,\n    frequent_words: FrequentwordType::English,\n    ngram_indexing: NgramSet::NgramFF as u8,\n    document_compression: DocumentCompression::Snappy,\n    access_type: AccessType::Mmap,\n    spelling_correction: Some(SpellingCorrection { max_dictionary_edit_distance: 1, term_length_threshold: Some([2,8].into()),count_threshold: 20,max_dictionary_entries:500_000 }),\n    query_completion: Some(QueryCompletion{max_completion_entries:10_000_000}),\n    clustering: Clustering::None,\n    inference: Inference::None,\n};\n\nlet segment_number_bits1=11;\nlet index_arc=create_index(index_path,meta,\u0026schema,\u0026Vec::new(),segment_number_bits1,false,None).await.unwrap();\n\n# });\n```\n\nThen, index JSON file in JSON, Newline-delimited JSON and Concatenated JSON format\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::ingest::IngestJson;\nuse seekstorm::index::open_index;\nuse std::path::Path;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap(); \n\nlet file_path=Path::new(\"wiki-articles.json\");\nlet _ =index_arc.ingest_json(file_path).await;\n\n# })\n```\n\nindex all PDF files in directory and sub-directories\n- converts pdf to text and indexes it\n- extracts title from metatag, or first line of text, or from filename\n- extracts creation date from metatag, or from file creation date (Unix timestamp: the number of seconds since 1 January 1970)\n- copies all ingested PDF files to the \"files\" subdirectory in the index.\n\nFirst, you need to create an index with the following PDF specific schema (index/schema are automatically created when ingesting via the console `ingest` command):\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse std::path::Path;\nuse seekstorm::index::{IndexMetaObject,Clustering,LexicalSimilarity,TokenizerType,StopwordType,FrequentwordType,AccessType,StemmerType,NgramSet,SchemaField,FieldType,SpellingCorrection,QueryCompletion,DocumentCompression,create_index};\nuse seekstorm::vector::Inference;\nuse seekstorm::vector_similarity::VectorSimilarity;\n\nlet index_path=Path::new(\"C:/index/\");\n\nlet schema= vec![\n    // field, stored, indexed, field_type, facet, longest, boost\n    SchemaField::new(\"title\".to_owned(), true, true,false, FieldType::Text, false,false, 10.0,false,false),\n    SchemaField::new(\"body\".to_owned(),true,true,false,FieldType::Text,false,true,1.0,false,false),\n    SchemaField::new(\"url\".to_owned(), true, false,false, FieldType::Text,false,false,1.0,false,false),\n    SchemaField::new(\"date\".to_owned(), true, false,false, FieldType::Timestamp,true,false,1.0,false,false),\n];\n\nlet meta = IndexMetaObject {\n    id: 0,\n    name: \"pdf_index\".into(),\n    lexical_similarity: LexicalSimilarity::Bm25fProximity,\n    tokenizer: TokenizerType::UnicodeAlphanumeric,\n    stemmer: StemmerType::None,\n    stop_words: StopwordType::None,\n    frequent_words: FrequentwordType::English,\n    ngram_indexing: NgramSet::NgramFF as u8,\n    document_compression: DocumentCompression::Snappy,\n    access_type: AccessType::Mmap,\n    spelling_correction: Some(SpellingCorrection { max_dictionary_edit_distance: 1, term_length_threshold: Some([2,8].into()),count_threshold: 20,max_dictionary_entries:500_000 }),\n    query_completion: Some(QueryCompletion{max_completion_entries:10_000_000}),\n    clustering: Clustering::None,\n    inference: Inference::None,\n};\n\nlet segment_number_bits1=11;\nlet index_arc=create_index(index_path,meta,\u0026schema,\u0026Vec::new(),segment_number_bits1,false,None).await.unwrap();\n\n# });\n```\n\nThen, ingest all PDF files from a given path:\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::index::open_index;\nuse std::path::Path;\nuse seekstorm::ingest::IngestPdf;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap();\n\nlet file_path=Path::new(\"C:/Users/johndoe/Downloads\");\nlet _ =index_arc.ingest_pdf(file_path).await;\n\n# });\n```\n\nindex PDF file\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::index::open_index;\nuse std::path::Path;\nuse seekstorm::ingest::IndexPdfFile;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap();\n\nlet file_path=Path::new(\"C:/test.pdf\");\nlet _ =index_arc.index_pdf_file(file_path).await;\n\n# });\n```\n\nindex PDF file bytes\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::index::open_index;\nuse std::path::Path;\nuse std::fs;\nuse chrono::Utc;\nuse seekstorm::ingest::IndexPdfBytes;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap();\n\n//solely used as meta data if it can't be extracted from document bytes\nlet file_date=Utc::now().timestamp();\nlet file_path=Path::new(\"C:/test.pdf\");\n\nlet document = fs::read(file_path).unwrap();\nlet _ =index_arc.index_pdf_bytes(file_path, file_date, \u0026document).await;\n\n# });\n```\n\nget PDF file bytes\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::index::open_index;\nuse std::path::Path;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap();\n\nlet doc_id=0;\nlet _file=index_arc.read().await.get_file(doc_id).await.unwrap();\n\n# });\n```\n\nclear index\n```rust, no_run\n# tokio_test::block_on(async {\nuse seekstorm::index::open_index;\nuse std::path::Path;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap();\n\nindex_arc.write().await.clear_index().await;\n\n# });\n```\n\ndelete index\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::index::open_index;\nuse std::path::Path;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap();\n\nindex_arc.write().await.delete_index();\n\n# });\n```\n\n\niterate through document ID of an index\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::{index::open_index,iterator::GetIterator};\nuse std::path::Path;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap();\n\n//display min_docid: the min_docid is NOT always 0, if the first shards are empty!\nlet iterator=index_arc.get_iterator(None,0,1,false,false,vec![]).await;\nprintln!(\"min doc_id: {}\",iterator.results.first().unwrap().doc_id);\n\n//display max_docid\nlet iterator=index_arc.get_iterator(None,0,-1,false,false,vec![]).await;\nprintln!(\"max doc_id: {}\",iterator.results.first().unwrap().doc_id);\n\n//iterate doc_id ascending, display the lowest 10 and then every 10_000th document ID\nlet mut iterator=index_arc.get_iterator(None,0,1,false,false,vec![]).await;\nlet mut i=0;\nif !iterator.results.is_empty() {println!(\"$ i: {} doc_id: {}\",i,iterator.results.first().unwrap().doc_id);}\nwhile !iterator.results.is_empty() {           \n    iterator=index_arc.get_iterator(Some(iterator.results.first().unwrap().doc_id),1,1,false,false,vec![]).await;                              \n    i+=1;\n    if !iterator.results.is_empty() \u0026\u0026 ( i % 10_000 ==0 || i\u003c=10 )  {println!(\"i: {} doc_id: {}\",i,iterator.results.first().unwrap().doc_id);}\n}\n\n//iterate doc_id descending, display the highest 10 and then every 10_000th document ID\nlet mut iterator=index_arc.get_iterator(None,0,-1,false,false,vec![]).await;\nlet mut i=0;\nif !iterator.results.is_empty() {println!(\"$ i: {} doc_id: {}\",i,iterator.results.first().unwrap().doc_id);}\nwhile !iterator.results.is_empty() {           \n    iterator=index_arc.get_iterator(Some(iterator.results.first().unwrap().doc_id),1,-1,false,false,vec![]).await;                              \n    i+=1;\n    if !iterator.results.is_empty() \u0026\u0026 ( i % 10_000 ==0 || i\u003c=10 )  {println!(\"i: {} doc_id: {}\",i,iterator.results.first().unwrap().doc_id);}\n}\n\nindex_arc.write().await.delete_index();\n\n# });\n```\n\n\nclose index\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse seekstorm::index::open_index;\nuse seekstorm::index::Close;\nuse std::path::Path;\n\nlet index_path=Path::new(\"C:/index/\");\nlet mut index_arc=open_index(index_path,false).await.unwrap();\n\nindex_arc.close().await;\n\n# });\n```\n\nseekstorm library version string\n```rust ,no_run\nuse seekstorm::index::version;\n\nlet version=version();\nprintln!(\"version {}\",version);\n```\n\u003cbr/\u003e\n\n---\n### Faceted search - Quick start\n\nFacets are defined in 3 different places:\n1. The facet fields are defined in the schema at create_index.\n2. The facet field values are set in index_document at index time.\n3. The query_facets/facet_filter parameters are specified at query time.  \n   Facets are then returned in the search result object.\n\nA minimal working example of faceted indexing \u0026 search requires just 60 lines of code. But to puzzle it all together from the documentation alone might be tedious. This is why we provide a quick start example here:\n\nAdd required crates to your project\n```text\ncargo add seekstorm\ncargo add tokio\ncargo add serde_json\n```\n\nUse an asynchronous Rust runtime\n```rust ,no_run\nuse std::error::Error;\n#[tokio::main]\nasync fn main() -\u003e Result\u003c(), Box\u003cdyn Error + Send + Sync\u003e\u003e {\n\n  // your SeekStorm code here\n\n   Ok(())\n}\n```\ncreate index\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse std::path::Path;\nuse std::sync::{Arc, RwLock};\nuse seekstorm::index::{IndexMetaObject, Clustering,LexicalSimilarity,TokenizerType,StopwordType,FrequentwordType,AccessType,StemmerType,NgramSet,DocumentCompression,create_index};\nuse seekstorm::vector::Inference;\nuse seekstorm::vector_similarity::VectorSimilarity;\n\nlet index_path=Path::new(\"C:/index/\");//x\n\nlet schema_json = r#\"\n[{\"field\":\"title\",\"field_type\":\"Text\",\"store\":false,\"index_lexical\":false},\n{\"field\":\"body\",\"field_type\":\"Text\",\"store\":true,\"index_lexical\":true},\n{\"field\":\"url\",\"field_type\":\"Text\",\"store\":true,\"index_lexical\":false},\n{\"field\":\"town\",\"field_type\":\"String16\",\"store\":false,\"index_lexical\":false,\"facet\":true}]\"#;\nlet schema=serde_json::from_str(schema_json).unwrap();\n\nlet meta = IndexMetaObject {\n    id: 0,\n    name: \"test_index\".into(),\n    lexical_similarity: LexicalSimilarity::Bm25f,\n    tokenizer: TokenizerType::AsciiAlphabetic,\n    stemmer: StemmerType::None,\n    stop_words: StopwordType::None,\n    frequent_words: FrequentwordType::English,\n    ngram_indexing: NgramSet::NgramFF as u8,\n    document_compression: DocumentCompression::Snappy,\n    access_type: AccessType::Mmap,\n    spelling_correction: None,\n    query_completion: None,\n    clustering: Clustering::None,\n    inference: Inference::None,\n};\n\nlet synonyms=Vec::new();\n\nlet segment_number_bits1=11;\nlet index_arc=create_index(index_path,meta,\u0026schema,\u0026synonyms,segment_number_bits1,false,None).await.unwrap();\n\n# });\n```\n\nindex documents\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse std::path::Path;\nuse seekstorm::index::{IndexDocuments,open_index};\nuse seekstorm::commit::Commit;\n\nlet index_path=Path::new(\"C:/index/\");\nlet index_arc=open_index(index_path,false).await.unwrap();\n\nlet documents_json = r#\"\n[{\"title\":\"title1 test\",\"body\":\"body1\",\"url\":\"url1\",\"town\":\"Berlin\"},\n{\"title\":\"title2\",\"body\":\"body2 test\",\"url\":\"url2\",\"town\":\"Warsaw\"},\n{\"title\":\"title3 test\",\"body\":\"body3 test\",\"url\":\"url3\",\"town\":\"New York\"}]\"#;\nlet documents_vec=serde_json::from_str(documents_json).unwrap();\n\nindex_arc.index_documents(documents_vec).await; \n\n// ### commit documents\n\nindex_arc.commit().await;\n\n# });\n```\n\nsearch index\n```rust ,no_run\n# tokio_test::block_on(async {\n\nuse std::path::Path;\nuse seekstorm::index::{IndexDocuments,open_index};\nuse seekstorm::search::{Search,SearchMode,QueryType,ResultType,QueryFacet,QueryRewriting};\nuse seekstorm::highlighter::{Highlight,highlighter};\nuse std::collections::HashSet;\n\nlet index_path=Path::new(\"C:/index/\");\nlet index_arc=open_index(index_path,false).await.unwrap();\nlet query=\"test\".to_string();\nlet query_vector=None;\nlet search_mode=SearchMode::Lexical;\nlet enable_empty_query=false;\nlet offset=0;\nlet length=10;\nlet query_type=QueryType::Intersection; \nlet result_type=ResultType::TopkCount;\nlet include_uncommitted=false;\nlet field_filter=Vec::new();\nlet query_facets = vec![QueryFacet::String16 {field: \"age\".to_string(),prefix: \"\".to_string(),length:u16::MAX}];\nlet facet_filter=Vec::new();\n//let facet_filter = vec![FacetFilter::String { field: \"town\".to_string(),filter: vec![\"Berlin\".to_string()],}];\nlet result_sort=Vec::new();\n\nlet result_object = index_arc.search(query, query_vector, query_type, search_mode, enable_empty_query, offset, length, result_type,include_uncommitted,field_filter,query_facets,facet_filter,result_sort,QueryRewriting::SearchOnly).await;\n\n// ### display results\n\nlet highlights:Vec\u003cHighlight\u003e= vec![\n        Highlight {\n            field: \"body\".to_owned(),\n            name:String::new(),\n            fragment_number: 2,\n            fragment_size: 160,\n            highlight_markup: true,\n            ..Default::default()\n        },\n    ];    \n\nlet highlighter=Some(highlighter(\u0026index_arc,highlights, result_object.query_terms).await);\nlet return_fields_filter= HashSet::new();\nlet distance_fields=Vec::new();\nlet index=index_arc.write().await;\nfor result in result_object.results.iter() {\n  let doc=index.get_document(result.doc_id,false,\u0026highlighter,\u0026return_fields_filter,\u0026distance_fields).await.unwrap();\n  println!(\"result {} rank {} body field {:?}\" , result.doc_id,result.score, doc.get(\"body\"));\n}\nprintln!(\"result counts {} {} {}\",result_object.results.len(), result_object.result_count, result_object.result_count_total);\n\n// ### display facets\n\nprintln!(\"{}\", serde_json::to_string_pretty(\u0026result_object.facets).unwrap());\n\n# });\n```\n\n### Vector search: internal inference\n\ncreate index\n```rust ,no_run\n# tokio_test::block_on(async {\n\n    use std::path::Path;\n    use std::sync::{Arc, RwLock};\n    use seekstorm::index::{IndexMetaObject, Clustering,LexicalSimilarity,TokenizerType,StopwordType,FrequentwordType,AccessType,StemmerType,NgramSet,DocumentCompression,create_index};\n    use seekstorm::vector::{Embedding, Inference, Model, Precision, Quantization};\n    use seekstorm::vector_similarity::VectorSimilarity;\n\n    let index_path=Path::new(\"tests/index_test/\");\n\n    let schema_json = r#\"\n    [{\"field\":\"title\",\"field_type\":\"Text\",\"store\":false,\"index_lexical\":false,\"index_vector\":true},\n    {\"field\":\"body\",\"field_type\":\"Text\",\"store\":true,\"index_lexical\":false,\"index_vector\":true},\n    {\"field\":\"url\",\"field_type\":\"Text\",\"store\":false,\"index_lexical\":false,\"index_vector\":false}]\"#;\n    let schema=serde_json::from_str(schema_json).unwrap();\n\n    let meta = IndexMetaObject {\n        id: 0,\n        name: \"test_index\".into(),\n        lexical_similarity: LexicalSimilarity::Bm25f,\n        tokenizer: TokenizerType::UnicodeAlphanumeric,\n        stemmer: StemmerType::None,\n        stop_words: StopwordType::None ,\n        frequent_words: FrequentwordType::English,\n        ngram_indexing: NgramSet::SingleTerm as u8 ,\n        document_compression: DocumentCompression::Snappy,\n        access_type: AccessType::Mmap,\n        spelling_correction: None,\n        query_completion: None,\n        clustering: Clustering::None,\n        inference: Inference::Model2Vec { model: Model::PotionBase2M, chunk_size: 1000, quantization: Quantization::I8 },\n    };\n    \n    let segment_number_bits1=11;\n    let index_arc=create_index(index_path,meta,\u0026schema,\u0026Vec::new(),segment_number_bits1,false,None).await.unwrap();\n    let index=index_arc.read().await;\n\n    let result=index.meta.id;\n    assert_eq!(result, 0);\n# });\n```\n\nindex documents/vectors\n```rust ,no_run\n# tokio_test::block_on(async {\n\n    use std::path::Path;\n    use seekstorm::index::{IndexDocuments,open_index};\n    use seekstorm::commit::Commit;\n\n    // open index\n    let index_path=Path::new(\"tests/index_test/\");\n    let index_arc=open_index(index_path,false).await.unwrap(); \n\n    // index documents\n    let documents_json = r#\"\n    [{\"title\":\"pink panther\",\"body\":\"animal from a comedy\",\"url\":\"url1\"},\n    {\"title\":\"blue whale\",\"body\":\"largest mammal in the ocean\",\"url\":\"url2\"},\n    {\"title\":\"red fox\",\"body\":\"small carnivorous mammal\",\"url\":\"url3\"}]\"#;\n    let documents_vec=serde_json::from_str(documents_json).unwrap();\n    index_arc.index_documents(documents_vec).await;\n\n    // wait until all index threads are finished and commit\n    index_arc.commit().await;\n\n    let result=index_arc.read().await.indexed_doc_count().await;\n    assert_eq!(result, 3);\n# });\n```\n\nquery documents/vectors\n```rust ,no_run\n# tokio_test::block_on(async {\n\n    use std::path::Path;\n    use seekstorm::index::{IndexDocuments,open_index};\n    use seekstorm::search::{Search,SearchMode,QueryType,ResultType,QueryFacet,QueryRewriting};\n    use seekstorm::vector_similarity::{AnnMode, VectorSimilarity};\n    use seekstorm::commit::Commit;\n    use seekstorm::highlighter::{Highlight,highlighter};\n    use std::collections::HashSet;\n\n   // open index\n    let index_path=Path::new(\"tests/index_test/\");\n    let index_arc=open_index(index_path,false).await.unwrap(); \n\n    let result=index_arc.read().await.indexed_doc_count().await;\n    assert_eq!(result, 3);\n\n    let query=\"rosy panther\".into();\n    let result_object = index_arc\n        .search(\n            query,\n            None,\n            QueryType::Union,\n            SearchMode::Vector { similarity_threshold: Some(0.7), ann_mode: AnnMode::All },\n            false,\n            0,\n            10,\n            ResultType::TopkCount,\n            false,\n            Vec::new(),\n            Vec::new(),\n            Vec::new(),\n            Vec::new(),\n            QueryRewriting::SearchOnly,\n        )\n        .await;\n\n    let result=result_object.results.len();\n    assert_eq!(result, 1);\n\n    let result=result_object.result_count;\n    assert_eq!(result, 1);\n\n    let result=result_object.result_count_total;\n    assert_eq!(result, 1);\n# });\n```\n\n### Vector search: external inference\n\ncreate index\n```rust ,no_run\n# tokio_test::block_on(async {\n\n    use std::path::Path;\n    use std::sync::{Arc, RwLock};\n    use seekstorm::index::{IndexMetaObject, Clustering,LexicalSimilarity,TokenizerType,StopwordType,FrequentwordType,AccessType,StemmerType,NgramSet,DocumentCompression,create_index};\n    use seekstorm::vector::{Embedding, Inference, Model, Precision, Quantization};\n    use seekstorm::vector_similarity::VectorSimilarity;\n\n   let index_path=Path::new(\"tests/index_test/\");\n\n    let schema_json = r#\"\n    [{\"field\":\"vector\",\"field_type\":\"Json\",\"store\":false,\"index_lexical\":false,\"index_vector\":true},\n    {\"field\":\"index\",\"field_type\":\"Text\",\"store\":true,\"index_lexical\":false,\"index_vector\":false}]\"#;\n    let schema=serde_json::from_str(schema_json).unwrap();\n\n    let meta = IndexMetaObject {\n        id: 0,\n        name: \"test_index\".into(),\n        lexical_similarity: LexicalSimilarity::Bm25f,\n        tokenizer: TokenizerType::UnicodeAlphanumeric,\n        stemmer: StemmerType::None,\n        stop_words: StopwordType::None ,\n        frequent_words: FrequentwordType::English,\n        ngram_indexing: NgramSet::SingleTerm as u8 ,\n        document_compression: DocumentCompression::Snappy,\n        access_type: AccessType::Mmap,\n        spelling_correction: None,\n        query_completion: None,\n        clustering: Clustering::None,\n        inference: Inference::External { dimensions: 128, precision: Precision::F32,  quantization: Quantization::None,similarity:VectorSimilarity::Euclidean },\n    };\n    \n    let segment_number_bits1=11;\n    let index_arc=create_index(index_path,meta,\u0026schema,\u0026Vec::new(),segment_number_bits1,false,None).await.unwrap();\n    let index=index_arc.read().await;\n\n    let result=index.meta.id;\n    assert_eq!(result, 0);\n\n# });\n```\n\nindex documents/vectors\n```rust ,no_run\n# tokio_test::block_on(async {\n\n    use std::path::Path;\n    use seekstorm::index::{IndexDocuments,open_index};\n    use seekstorm::commit::Commit;\n\n    // open index\n    let index_path=Path::new(\"tests/index_test/\");\n    let index_arc=open_index(index_path,false).await.unwrap(); \n\n    // index documents\n    let documents_json = r#\"\n    [{\"vector\":[0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.010, 0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017, 0.018, 0.019, 0.020, 0.021, 0.022, 0.023, 0.024, 0.025, 0.026, 0.027, 0.028, 0.029, 0.030, 0.031, 0.032, 0.033, 0.034, 0.035, 0.036, 0.037, 0.038, 0.039, 0.040, 0.041, 0.042, 0.043, 0.044, 0.045, 0.046, 0.047, 0.048, 0.049, 0.050, 0.051, 0.052, 0.053, 0.054, 0.055, 0.056, 0.057, 0.058, 0.059, 0.060, 0.061, 0.062, 0.063, 0.064, 0.065, 0.066, 0.067, 0.068, 0.069, 0.070, 0.071, 0.072, 0.073, 0.074, 0.075, 0.076, 0.077, 0.078, 0.079, 0.080, 0.081, 0.082, 0.083, 0.084, 0.085, 0.086, 0.087, 0.088, 0.089, 0.090, 0.091, 0.092, 0.093, 0.094, 0.095, 0.096, 0.097, 0.098, 0.099, 0.100, 0.101, 0.102, 0.103, 0.104, 0.105, 0.106, 0.107, 0.108, 0.109, 0.110, 0.111, 0.112, 0.113, 0.114, 0.115, 0.116, 0.117, 0.118, 0.119, 0.120, 0.121, 0.122, 0.123, 0.124, 0.125, 0.126, 0.127, 0.128],\"index\":\"0\"},\n    {\"vector\":[0.129, 0.130, 0.131, 0.132, 0.133, 0.134, 0.135, 0.136, 0.137, 0.138, 0.139, 0.140, 0.141, 0.142, 0.143, 0.144, 0.145, 0.146, 0.147, 0.148, 0.149, 0.150, 0.151, 0.152, 0.153, 0.154, 0.155, 0.156, 0.157, 0.158, 0.159, 0.160, 0.161, 0.162, 0.163, 0.164, 0.165, 0.166, 0.167, 0.168, 0.169, 0.170, 0.171, 0.172, 0.173, 0.174, 0.175, 0.176, 0.177, 0.178, 0.179, 0.180, 0.181, 0.182, 0.183, 0.184, 0.185, 0.186, 0.187, 0.188, 0.189, 0.190, 0.191, 0.192, 0.193, 0.194, 0.195, 0.196, 0.197, 0.198, 0.199, 0.200, 0.201, 0.202, 0.203, 0.204, 0.205, 0.206, 0.207, 0.208, 0.209, 0.210, 0.211, 0.212, 0.213, 0.214, 0.215, 0.216, 0.217, 0.218, 0.219, 0.220, 0.221, 0.222, 0.223, 0.224, 0.225, 0.226, 0.227, 0.228, 0.229, 0.230, 0.231, 0.232, 0.233, 0.234, 0.235, 0.236, 0.237, 0.238, 0.239, 0.240, 0.241, 0.242, 0.243, 0.244, 0.245, 0.246, 0.247, 0.248, 0.249, 0.250, 0.251, 0.252, 0.253, 0.254, 0.255, 0.256],\"index\":\"1\"},\n    {\"vector\":[0.257, 0.258, 0.259, 0.260, 0.261, 0.262, 0.263, 0.264, 0.265, 0.266, 0.267, 0.268, 0.269, 0.270, 0.271, 0.272, 0.273, 0.274, 0.275, 0.276, 0.277, 0.278, 0.279, 0.280, 0.281, 0.282, 0.283, 0.284, 0.285, 0.286, 0.287, 0.288, 0.289, 0.290, 0.291, 0.292, 0.293, 0.294, 0.295, 0.296, 0.297, 0.298, 0.299, 0.300, 0.301, 0.302, 0.303, 0.304, 0.305, 0.306, 0.307, 0.308, 0.309, 0.310, 0.311, 0.312, 0.313, 0.314, 0.315, 0.316, 0.317, 0.318, 0.319, 0.320, 0.321, 0.322, 0.323, 0.324, 0.325, 0.326, 0.327, 0.328, 0.329, 0.330, 0.331, 0.332, 0.333, 0.334, 0.335, 0.336, 0.337, 0.338, 0.339, 0.340, 0.341, 0.342, 0.343, 0.344, 0.345, 0.346, 0.347, 0.348, 0.349, 0.350, 0.351, 0.352, 0.353, 0.354, 0.355, 0.356, 0.357, 0.358, 0.359, 0.360, 0.361, 0.362, 0.363, 0.364, 0.365, 0.366, 0.367, 0.368, 0.369, 0.370, 0.371, 0.372, 0.373, 0.374, 0.375, 0.376, 0.377, 0.378, 0.379, 0.380, 0.381, 0.382, 0.383, 0.384],\"index\":\"2\"}]\"#;\n    let documents_vec=serde_json::from_str(documents_json).unwrap();\n    index_arc.index_documents(documents_vec).await;\n\n    // wait until all index threads are finished and commit\n    index_arc.commit().await;\n\n    let result=index_arc.read().await.indexed_doc_count().await;\n    assert_eq!(result, 3);\n\n# });\n```\n\nquery documents/vectors\n```rust ,no_run\n# tokio_test::block_on(async {\n\n    use std::path::Path;\n    use seekstorm::index::{IndexDocuments,open_index};\n    use seekstorm::search::{Search,SearchMode,QueryType,ResultType,QueryFacet,QueryRewriting};\n    use seekstorm::vector_similarity::{AnnMode, VectorSimilarity};\n    use seekstorm::commit::Commit;\n    use seekstorm::vector::Embedding;\n    use seekstorm::highlighter::{Highlight,highlighter};\n    use std::collections::HashSet;\n\n  // open index\n    let index_path=Path::new(\"tests/index_test/\");\n    let index_arc=open_index(index_path,false).await.unwrap(); \n\n    let result=index_arc.read().await.indexed_doc_count().await;\n    assert_eq!(result, 3);\n\n    let query=String::new();\n    let query_vector = vec![0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.010, 0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017, 0.018, 0.019, 0.020, 0.021, 0.022, 0.023, 0.024, 0.025, 0.026, 0.027, 0.028, 0.029, 0.030, 0.031, 0.032, 0.033, 0.034, 0.035, 0.036, 0.037, 0.038, 0.039, 0.040, 0.041, 0.042, 0.043, 0.044, 0.045, 0.046, 0.047, 0.048, 0.049, 0.050, 0.051, 0.052, 0.053, 0.054, 0.055, 0.056, 0.057, 0.058, 0.059, 0.060, 0.061, 0.062, 0.063, 0.064, 0.065, 0.066, 0.067, 0.068, 0.069, 0.070, 0.071, 0.072, 0.073, 0.074, 0.075, 0.076, 0.077, 0.078, 0.079, 0.080, 0.081, 0.082, 0.083, 0.084, 0.085, 0.086, 0.087, 0.088, 0.089, 0.090, 0.091, 0.092, 0.093, 0.094, 0.095, 0.096, 0.097, 0.098, 0.099, 0.100, 0.101, 0.102, 0.103, 0.104, 0.105, 0.106, 0.107, 0.108, 0.109, 0.110, 0.111, 0.112, 0.113, 0.114, 0.115, 0.116, 0.117, 0.118, 0.119, 0.120, 0.121, 0.122, 0.123, 0.124, 0.125, 0.126, 0.127, 0.128];\n    let query_embedding=Embedding::F32(query_vector);\n    let result_object = index_arc\n        .search(\n            query,\n            Some(query_embedding),\n            QueryType::Union,\n            SearchMode::Vector { similarity_threshold: None, ann_mode: AnnMode::All },\n            false,\n            0,\n            10,\n            ResultType::TopkCount,\n            false,\n            Vec::new(),\n            Vec::new(),\n            Vec::new(),\n            Vec::new(),\n            QueryRewriting::SearchOnly,\n        )\n        .await;\n\n    let result=result_object.results.len();\n    assert_eq!(result, 3);\n\n    let result=result_object.result_count;\n    assert_eq!(result, 3);\n\n    let result=result_object.result_count_total;\n    assert_eq!(result, 3);\n\n# });\n```\n\n### Vector search: SIFT1M dataset\n\n- 1 million vectors, 128 dimensions, f32 precision\n- nprobe=16 -\u003e recall@10=95%, average latency=0.18 milliseconds \n- nprobe=33 -\u003e recall@10=99%, average latency=0.30 milliseconds \n\ncreate index\n```rust ,no_run\n# tokio_test::block_on(async {\n\n    use std::path::Path;\n    use std::sync::{Arc, RwLock};\n    use seekstorm::index::{IndexMetaObject, Clustering,LexicalSimilarity,TokenizerType,StopwordType,FrequentwordType,AccessType,StemmerType,NgramSet,DocumentCompression,create_index};\n    use seekstorm::vector::{Embedding, Inference, Model, Precision, Quantization};\n    use seekstorm::vector_similarity::VectorSimilarity;\n\n   let index_path=Path::new(\"tests/index_test/\");\n\n    let schema_json = r#\"\n    [{\"field\":\"vector\",\"field_type\":\"Json\",\"store\":false,\"index_lexical\":false,\"index_vector\":true},\n    {\"field\":\"index\",\"field_type\":\"Text\",\"store\":true,\"index_lexical\":false,\"index_vector\":false}]\"#;\n    let schema=serde_json::from_str(schema_json).unwrap();\n\n    let meta = IndexMetaObject {\n        id: 0,\n        name: \"test_index\".into(),\n        lexical_similarity: LexicalSimilarity::Bm25f,\n        tokenizer: TokenizerType::UnicodeAlphanumeric,\n        stemmer: StemmerType::None,\n        stop_words: StopwordType::None ,\n        frequent_words: FrequentwordType::English,\n        ngram_indexing: NgramSet::SingleTerm as u8 ,\n        document_compression: DocumentCompression::Snappy,\n        access_type: AccessType::Mmap,\n        spelling_correction: None,\n        query_completion: None,\n        clustering: Clustering::Auto,\n        inference: Inference::External { dimensions: 128, precision: Precision::F32, quantization: Quantization::I8, similarity:VectorSimilarity::Euclidean },\n    };\n    \n    let segment_number_bits1=11;\n    let index_arc=create_index(index_path,meta,\u0026schema,\u0026Vec::new(),segment_number_bits1,false,None).await.unwrap();\n    let index=index_arc.read().await;\n\n    let result=index.meta.id;\n    assert_eq!(result, 0);\n\n# });\n```\n\nindex documents/vectors\n```rust ,no_run\n# tokio_test::block_on(async {\n\n    use std::path::Path;\n    use seekstorm::index::{IndexDocuments,open_index};\n    use seekstorm::commit::Commit;\n    use seekstorm::ingest::{read_fvecs,ingest_sift};\n\n    // open index\n    let index_path=Path::new(\"tests/index_test/\");\n    let index_arc=open_index(index_path,false).await.unwrap(); \n\n    // index documents \n    // download data from http://corpus-texmex.irisa.fr/\n    ingest_sift(\u0026index_arc, Path::new(r\"C:\\testset\\sift_base.fvecs\"), None).await;\n\n    let result=index_arc.read().await.indexed_doc_count().await;\n    assert_eq!(result, 3);\n\n# });\n```\n\nquery documents/vectors\n```rust ,no_run\n# tokio_test::block_on(async {\n\n    use std::path::Path;\n    use std::collections::HashSet;\n    use std::time::Instant;\n    use indexmap::IndexMap;\n    use seekstorm::index::{IndexDocuments,open_index};\n    use seekstorm::search::{Search,SearchMode,QueryType,ResultType,QueryFacet,QueryRewriting};\n    use seekstorm::vector_similarity::{AnnMode, VectorSimilarity};\n    use seekstorm::commit::Commit;\n    use seekstorm::vector::Embedding;\n    use seekstorm::ingest::{read_fvecs, read_ivecs, ingest_sift};\n    use seekstorm::highlighter::{Highlight,highlighter};\n    use num_format::{Locale, ToFormattedString};\n    use serde_json::Value;\n\n    // open index\n    let index_path=Path::new(\"tests/index_test/\");\n    let index_arc=open_index(index_path,false).await.unwrap(); \n\n    let result=index_arc.read().await.indexed_doc_count().await;\n    assert_eq!(result, 1000_000);\n\n    let query=\"\";\n\n    let len=10;\n    let similarity_threshold=None;\n    let field_filter=Vec::new();\n    let fields_hashset=HashSet::new();\n\n    let mut search_time_sum=0;\n    let mut results_sum=0;\n    let mut result_count_total_sum=0;\n    let mut observed_cluster_count_sum=0;\n    let mut observed_vector_count_sum=0;\n    let mut recall_count_sum=0;\n\n    if let Ok(ground_truth) = read_ivecs(r\"C:\\testset\\sift_groundtruth.ivecs\") {\n\n        if let Ok(queries) = read_fvecs(r\"C:\\testset\\sift_query.fvecs\") {\n\n            let queries_len=queries.len();\n\n            for (query_idx, query_embedding) in queries.into_iter().enumerate().take(queries_len) {\n\n                let ground_truth_for_query:IndexMap\u003cusize, usize\u003e = ground_truth[query_idx].iter().take(len).enumerate().map(|(i, x)| (*x as usize, i)).collect();\n\n                let query_embedding=Embedding::F32(query_embedding);\n\n                let start_time = Instant::now();\n\n                let result_object_vector = index_arc\n                .search(\n                    query.to_string(),\n                    Some(query_embedding),\n                    QueryType::Intersection,\n                    SearchMode::Vector { similarity_threshold , ann_mode: AnnMode::Nprobe(16)},\n                    false,\n                    0,\n                    len,\n                    ResultType::Topk,\n                    false,\n                    field_filter.clone(),\n                    Vec::new(),\n                    Vec::new(),\n                    Vec::new(),\n                    QueryRewriting::SearchOnly,\n                )\n                .await;\n\n                let search_time = start_time.elapsed().as_nanos() as i64;\n                search_time_sum+=search_time;\n                results_sum+=result_object_vector.results.len();\n                result_count_total_sum+=result_object_vector.result_count_total;\n                observed_cluster_count_sum+=result_object_vector.observed_cluster_count;\n                observed_vector_count_sum+=result_object_vector.observed_vector_count;\n\n                let mut recall_count=0;\n                for (i, result) in result_object_vector.results.iter().enumerate() {\n                    let doc = index_arc.read().await.get_document(result.doc_id, false,\u0026None, \u0026fields_hashset, \u0026Vec::new()).await.ok();\n                    let index_value= if let Some(doc) = \u0026doc { \n                          if let Some(index_field) = doc.get(\"index\") { index_field } else { \u0026Value::String(\"\".to_string()) } \n                        } \n                        else { \u0026Value::String(\"\".to_string()) };\n                    let index_string=serde_json::from_value::\u003cString\u003e(index_value.clone()).unwrap_or(index_value.to_string());\n                    let idx=index_string.parse::\u003cusize\u003e().unwrap_or(0);\n                    if ground_truth_for_query.contains_key(\u0026idx) { recall_count+=1; }\n\n                }\n\n                recall_count_sum+=recall_count;\n            }\n\n            let indexed_vector_count=index_arc.read().await.indexed_vector_count().await;\n            let indexed_cluster_count=index_arc.read().await.indexed_cluster_count().await;\n\n            println!(\"Search time: {} µs  result count {} result count total: {} clusters observed: {:.2}% ({} of {}) vectors observed: {:.2}% ({} of {}) recall: {:.2}%\", \n            (search_time_sum as usize/1000/queries_len).to_formatted_string(\u0026Locale::en), \n            results_sum.to_formatted_string(\u0026Locale::en), \n            result_count_total_sum.to_formatted_string(\u0026Locale::en),\n            (observed_cluster_count_sum as f64) / queries_len as f64 / (indexed_cluster_count as f64) * 100.0,\n            (observed_cluster_count_sum/queries_len).to_formatted_string(\u0026Locale::en), \n            indexed_cluster_count.to_formatted_string(\u0026Locale::en),\n            (observed_vector_count_sum as f64) / queries_len as f64 / (indexed_vector_count as f64) * 100.0,\n            (observed_vector_count_sum/queries_len).to_formatted_string(\u0026Locale::en), \n            indexed_vector_count.to_formatted_string(\u0026Locale::en),\n            (recall_count_sum as f64) / queries_len as f64 / (len as f64) * 100.0); \n            println!();\n        }\n    }\n\n# });\n```\n\n---\n\n## Demo time \n\n### Build a Wikipedia search engine with the SeekStorm server\n\nA quick step-by-step tutorial on how to build a Wikipedia search engine from a Wikipedia corpus using the SeekStorm server in 5 easy steps.\n\n\u003cimg src=\"assets/wikipedia_demo.png\" width=\"800\"\u003e\n\n**Download SeekStorm**\n\n[Download SeekStorm from the GitHub repository](https://github.com/SeekStorm/SeekStorm/archive/refs/heads/main.zip)  \nUnzip in a directory of your choice, open in Visual Studio code.\n\nor alternatively\n\n```text\ngit clone https://github.com/SeekStorm/SeekStorm.git\n```\n\n**Build SeekStorm**\n\nInstall Rust (if not yet present): https://www.rust-lang.org/tools/install  \n\nIn the terminal of Visual Studio Code type:\n```text\ncargo build --release\n```\n\n**Get Wikipedia corpus**\n\nPreprocessed English Wikipedia corpus (5,032,105 documents, 8,28 GB decompressed). \nAlthough wiki-articles.json has a .JSON extension, it is not a valid JSON file. \nIt is a text file, where every line contains a JSON object with url, title and body attributes. \nThe format is called [ndjson](https://github.com/ndjson/ndjson-spec) (\"Newline delimited JSON\").\n\n[Download Wikipedia corpus](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0)\n\nDecompresss Wikipedia corpus. \n\nhttps://gnuwin32.sourceforge.net/packages/bzip2.htm\n```text\nbunzip2 wiki-articles.json.bz2\n```\n\nMove the decompressed wiki-articles.json to the release directory\n\n**Start SeekStorm server**\n```text\ncd target/release\n```\n```text\n./seekstorm_server local_ip=\"0.0.0.0\" local_port=80\n```\n\n**Indexing** \n\nType 'ingest' into the command line of the running SeekStorm server: \n```text\ningest\n```\n\nThis creates the demo index  and indexes the local wikipedia file.\n\n\u003cimg src=\"assets/server_info.png\" width=\"800\" alt=\"server info\"\u003e\n\u003cbr\u003e\n\n**Start searching within the embedded WebUI**\n\nOpen embedded Web UI in browser: [http://127.0.0.1](http://127.0.0.1)\n\nEnter a query into the search box \n\n**Testing the REST API endpoints**\n\nOpen src/seekstorm_server/test_api.rest in VSC together with the VSC extension \"Rest client\" to execute API calls and inspect responses\n\n[interactive API endpoint examples](https://github.com/SeekStorm/SeekStorm/blob/main/src/seekstorm_server/test_api.rest)\n\nSet the 'individual API key' in test_api.rest to the api key displayed in the server console when you typed 'index' above.\n\n**Remove demo index**\n\nType 'delete' into the command line of the running SeekStorm server: \n```text\ndelete\n```\n\n**Shutdown server**\n\nType 'quit' into the commandline of the running SeekStorm server.\n```text\nquit\n```\n\n**Customizing**\n\nDo you want to use something similar for your own project?\nHave a look at the [ingest](/src/seekstorm_server/README.md#console-commands) and [web UI](/src/seekstorm_server/README.md#open-embedded-web-ui-in-browser) documentation.\n\n\n\n\n\n### Build a PDF search engine with the SeekStorm server\n\nA quick step-by-step tutorial on how to build a PDF search engine from a directory that contains PDF files using the SeekStorm server.  \nMake all your scientific papers, ebooks, resumes, reports, contracts, documentation, manuals, letters, bank statements, invoices, delivery notes searchable - at home or in your organisation.  \n\n\u003cimg src=\"assets/pdf_search.png\" width=\"800\"\u003e\n\n**Build SeekStorm**\n\nInstall Rust (if not yet present): https://www.rust-lang.org/tools/install  \n\nIn the terminal of Visual Studio Code type:\n```text\ncargo build --release\n```\n\n**Download PDFium**\n\nDownload and copy the Pdfium library into the same folder as the seekstorm_server.exe: https://github.com/bblanchon/pdfium-binaries\n\n**Start SeekStorm server**\n```text\ncd target/release\n```\n```text\n./seekstorm_server local_ip=\"0.0.0.0\" local_port=80\n```\n\n**Indexing** \n\nChoose a directory that contains PDF files you want to index and search, e.g. your documents or download directory.\n\nType 'ingest' into the command line of the running SeekStorm server: \n```text\ningest C:\\Users\\JohnDoe\\Downloads\n```\n\nThis creates the pdf_index and indexes all PDF files from the specified directory, including subdirectories.\n\n**Start searching within the embedded WebUI**\n\nOpen embedded Web UI in browser: [http://127.0.0.1](http://127.0.0.1)\n\nEnter a query into the search box \n\n**Remove demo index**\n\nType 'delete' into the command line of the running SeekStorm server: \n```text\ndelete\n```\n\n**Shutdown server**\n\nType 'quit' into the commandline of the running SeekStorm server.\n```text\nquit\n```\n\n\n\n\n\n### Online Demo: DeepHN Hacker News search\n\nFull-text search 30M Hacker News posts AND linked web pages\n\n[DeepHN.org](https://deephn.org/)\n\n\u003cimg src=\"assets/deephn_demo.png\" width=\"800\"\u003e\n\nThe DeepHN demo is still based on the SeekStorm C# codebase.  \nWe are currently porting all required missing features.  \nSee roadmap below.  \n\n---\n\n## Blog Posts\n\n- Search\n  - [N-gram index for faster phrase search: latency vs. size](https://seekstorm.com/blog/n-gram-indexing-for-faster-phrase-search/)\n  - [SeekStorm sharded index architecture - using a multi-core processor like a miniature data center](https://seekstorm.com/blog/SeekStorm-sharded-index-architecture/)\n  - [SeekStorm gets Faceted search, Geo proximity search, Result sorting](https://seekstorm.com/blog/faceted_search-geo-proximity-search/)\n  - [What is faceted search?](https://seekstorm.com/blog/what-is-faceted-search/)\n  - [SeekStorm is now Open Source](https://seekstorm.com/blog/sneak-peek-seekstorm-rust/)\n  - [Tail latencies and percentiles](https://seekstorm.com/blog/tail-latencies-and-percentiles/)\n- Query auto-completion\n  - [Typo-tolerant Query auto-completion (QAC) - derived from indexed documents](https://seekstorm.com/blog/query-auto-completion-(QAC)/)\n  - [The Pruning Radix Trie — a Radix Trie on steroids](https://seekstorm.com/blog/pruning-radix-trie/)\n- Query spelling correction\n  - [Sub-millisecond compound aware automatic spelling correction](https://seekstorm.com/blog/sub-millisecond-compound-aware-automatic.spelling-correction/)\n  - [SymSpell vs. BK-tree: 100x faster fuzzy string search \u0026 spell checking](https://seekstorm.com/blog/symspell-vs-bk-tree/)\n  - [1000x Faster Spelling Correction algorithm](https://seekstorm.com/blog/1000x-spelling-correction/)\n- Chinese word segmentation\n  - [Fast Word Segmentation of Noisy Text](https://seekstorm.com/blog/fast-word-segmentation-noisy-text/)\n\n---\n\n## Roadmap\n\nThe following new features are planned to be implemented.  \nAre you missing something? Let us know via issue or discussions.\n\n**Improvements**\n\n* Relevancy benchmarks: BeIR, MS MARCO\n\n**New features**\n\n* ✅ Native vector search\n* Geocoding, reverse geocoding, GeoJSON\n* Model Context Protocol (MCP) server for Retrieval Augmented Generation (RAG)\n* Split of storage and compute\n  * Use S3 object storage as index backend\n  * Use Distributed Key-Value store as index backend\n* Elasticity: automatic spawning and winding down of shards in the cloud depending on index size and load.\n* Distributed search cluster (currently PoC)\n* More tokenizer types (Japanese, Korean)\n* WebAssembly (Wasm)\n* Wrapper/bindings in JavaScript, Python, Java, C#, C, Go for the SeekStorm Rust library\n* Client libraries/SDK in JavaScript, Python, Java, C#, C, Go, Rust for the SeekStorm server REST API\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseekstorm%2Fseekstorm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseekstorm%2Fseekstorm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseekstorm%2Fseekstorm/lists"}