{"id":13451542,"url":"https://github.com/quickwit-oss/tantivy-cli","last_synced_at":"2025-04-05T10:09:17.966Z","repository":{"id":39750772,"uuid":"52072365","full_name":"quickwit-oss/tantivy-cli","owner":"quickwit-oss","description":null,"archived":false,"fork":false,"pushed_at":"2024-04-23T07:53:11.000Z","size":404,"stargazers_count":312,"open_issues_count":13,"forks_count":59,"subscribers_count":12,"default_branch":"main","last_synced_at":"2024-08-05T11:11:55.273Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/quickwit-oss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-02-19T08:16:43.000Z","updated_at":"2024-08-05T11:12:00.668Z","dependencies_parsed_at":"2024-01-14T05:00:22.734Z","dependency_job_id":"c92349d1-4af5-4c76-89fa-a98046696bd3","html_url":"https://github.com/quickwit-oss/tantivy-cli","commit_stats":null,"previous_names":["quickwit-inc/tantivy-cli","tantivy-search/tantivy-cli"],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Ftantivy-cli","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Ftantivy-cli/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Ftantivy-cli/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Ftantivy-cli/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/quickwit-oss","download_url":"https://codeload.github.com/quickwit-oss/tantivy-cli/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247318745,"owners_count":20919484,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T07:00:55.550Z","updated_at":"2025-04-05T10:09:17.944Z","avatar_url":"https://github.com/quickwit-oss.png","language":"Rust","funding_links":[],"categories":["Rust","others"],"sub_categories":[],"readme":"[![Docs](https://docs.rs/tantivy/badge.svg)](https://docs.rs/crate/tantivy-cli/)\n[![Join the chat at https://discord.gg/MT27AG5EVE](https://shields.io/discord/908281611840282624?label=chat%20on%20discord)](https://discord.gg/MT27AG5EVE)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy-cli)\n\n\n# tantivy-cli\n\n`tantivy-cli` is the the command line interface for the [tantivy](https://github.com/quickwit-inc/tantivy) search engine. It provides indexing and search capabilities, and is suitable for smaller projects.\n\nFor a more complete solution around tantivy, you may use\n- https://github.com/quickwit-oss/quickwit\n- https://github.com/lnx-search/lnx\n\n# Tutorial: Indexing Wikipedia with Tantivy CLI\n\n## Introduction\n\nIn this tutorial, we will create a brand new index with the articles of English wikipedia in it.\n\n## Installing the tantivy CLI.\n\nThere are a couple ways to install `tantivy-cli`.\n\nIf you are a Rust programmer, you probably have `cargo` installed and you can just:\n\n```bash\ncargo install tantivy-cli --locked\n```\n\n## Creating the index:  `new`\n \nLet's create a directory in which your index will be stored.\n\n```bash\nmkdir wikipedia-index\n```\n\nWe will now initialize the index and create its schema.\nThe [schema](https://quickwit-oss.github.io/tantivy/tantivy/schema/index.html) defines\nthe list of your fields, and for each field:\n- its name \n- its type, currently `u64`, `i64` or `str`\n- how it should be indexed.\n\nYou can find more information about the latter on \n[tantivy's schema documentation page](https://quickwit-oss.github.io/tantivy/tantivy/schema/index.html)\n\nIn our case, our documents will contain\n* a title\n* a body \n* a url\n\nWe want the title and the body to be tokenized and indexed. We also want \nto add the term frequency and term positions to our index.\n\nRunning `tantivy new` will start a wizard that will help you\ndefine the schema of the new index.\n\nLike all the other commands of `tantivy`, you will have to \npass it your index directory via the `-i` or `--index`\nparameter as follows:\n\n```bash\ntantivy new -i wikipedia-index\n```\n\nAnswer the questions as follows:\n\n```none\n\n    Creating new index \n    Let's define its schema! \n\n\n\n    New field name  ? title\n    Choose Field Type (Text/u64/i64/f64/Date/Facet/Bytes) ? Text\n    Should the field be stored (Y/N) ? Y\n    Should the field be indexed (Y/N) ? Y\n    Should the term be tokenized? (Y/N) ? Y\n    Should the term frequencies (per doc) be in the index (Y/N) ? Y\n    Should the term positions (per doc) be in the index (Y/N) ? Y\n    Add another field (Y/N) ? Y\n    \n    \n    \n    New field name  ? body\n    Choose Field Type (Text/u64/i64/f64/Date/Facet/Bytes) ? Text\n    Should the field be stored (Y/N) ? Y\n    Should the field be indexed (Y/N) ? Y\n    Should the term be tokenized? (Y/N) ? Y\n    Should the term frequencies (per doc) be in the index (Y/N) ? Y\n    Should the term positions (per doc) be in the index (Y/N) ? Y\n    Add another field (Y/N) ? Y\n    \n    \n    \n    New field name  ? url\n    Choose Field Type (Text/u64/i64/f64/Date/Facet/Bytes) ? Text\n    Should the field be stored (Y/N) ? Y\n    Should the field be indexed (Y/N) ? N\n    Add another field (Y/N) ? N\n\n\n    [\n    {\n        \"name\": \"title\",\n        \"type\": \"text\",\n        \"options\": {\n            \"indexing\": \"position\",\n            \"stored\": true\n        }\n    },\n    {\n        \"name\": \"body\",\n        \"type\": \"text\",\n        \"options\": {\n            \"indexing\": \"position\",\n            \"stored\": true\n        }\n    },\n    {\n        \"name\": \"url\",\n        \"type\": \"text\",\n        \"options\": {\n            \"indexing\": \"unindexed\",\n            \"stored\": true\n        }\n    }\n    ]\n\n\n```\n\nAfter the wizard has finished, a `meta.json` should exist in `wikipedia-index/meta.json`.\nIt is a fairly human readable JSON, so you can check its content.\n\nIt contains two sections:\n- segments (currently empty, but we will change that soon)\n- schema \n\n \n\n# Indexing the document: `index`\n\n\nTantivy's `index` command offers a way to index a json file.\nThe file must contain one JSON object per line.\nThe structure of this JSON object must match that of our schema definition.\n\n```json\n{\"body\": \"some text\", \"title\": \"some title\", \"url\": \"http://somedomain.com\"}\n```\n\nFor this tutorial, you can download a corpus with the 5 million+ English Wikipedia articles in the right format here: [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0).\nMake sure to decompress the file. Also, you can avoid this if you have `bzcat` installed so that you can read it compressed.\n\n```bash\nbunzip2 wiki-articles.json.bz2\n```\n\nIf you are in a rush you can [download 100 articles in the right format here (11 MB)](http://fulmicoton.com/tantivy-files/wiki-articles-1000.json).\n\nThe `index` command will index your document.\nBy default it will use as 3 thread, each with a buffer size of 1GB split a\nacross these threads. \n\n\n```bash\ncat wiki-articles.json | tantivy index -i ./wikipedia-index\n```\n\nYou can change the number of threads by passing it the `-t` parameter, and the total\nbuffer size used by the threads heap by using the `-m`. Note that tantivy's memory usage\nis greater than just this buffer size parameter.\n\nOn my computer (8 core Xeon(R) CPU X3450  @ 2.67GHz), on 8 threads, indexing wikipedia takes around 9 minutes.\n\n\nWhile tantivy is indexing, you can peek at the index directory to check what is happening.\n\n```bash\nls ./wikipedia-index\n```\n\nThe main file is `meta.json`.\n\nYou should also see a lot of files with a UUID as filename, and different extensions.\nOur index is in fact divided in segments. Each segment acts as an individual smaller index.\nIts name is simply a uuid. \n\nIf you decided to index the complete wikipedia, you may also see some of these files disappear.\nHaving too many segments can hurt search performance, so tantivy actually automatically starts\nmerging segments. \n\n# Serve the search index: `serve`\n\nTantivy's cli also embeds a search server.\nYou can run it with the following command.\n\n```bash\ntantivy serve -i wikipedia-index\n```\n\nBy default, it will serve on port `3000`.\n\nYou can search for the top 20 most relevant documents for the query `Barack Obama` by accessing\nthe following [url](http://localhost:3000/api/?q=barack+obama\u0026nhits=20) in your browser\n\n    http://localhost:3000/api/?q=barack+obama\u0026nhits=20\n\nBy default this query is treated as `barack OR obama`.\nYou can also search for documents that contains both term, by adding a `+` sign before the terms in your query.\n\n    http://localhost:3000/api/?q=%2Bbarack%20%2Bobama\u0026nhits=20\n    \nAlso, `-` makes it possible to remove documents the documents containing a specific term.\n\n    http://localhost:3000/api/?q=-barack%20%2Bobama\u0026nhits=20\n    \nFinally tantivy handle phrase queries.\n\n    http://localhost:3000/api/?q=%22barack%20obama%22\u0026nhits=20\n    \n\n# Search the index via the command line\n\nYou may also use the `search` command to stream all documents matching a specific query.\nThe documents are returned in an unspecified order.\n\n```bash\ntantivy search -i wikipedia-index -q \"barack obama\"\ntantivy search -i hdfs --query \"*\" --agg '{\"severities\":{\"terms\":{\"field\":\"severity_text\"}}}'\n```\n\n# Benchmark the index: `bench`\n\nTantivy's cli provides a simple benchmark tool.\nYou can run it with the following command.\n\n```bash\ntantivy bench -i wikipedia-index -n 10 -q queries.txt\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquickwit-oss%2Ftantivy-cli","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquickwit-oss%2Ftantivy-cli","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquickwit-oss%2Ftantivy-cli/lists"}