{"id":18447090,"url":"https://github.com/zot/microfts","last_synced_at":"2025-04-08T00:32:00.027Z","repository":{"id":49348947,"uuid":"320678302","full_name":"zot/microfts","owner":"zot","description":"Small and fast FTS (full text search)","archived":false,"fork":false,"pushed_at":"2023-10-30T14:21:32.000Z","size":863,"stargazers_count":32,"open_issues_count":8,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-23T03:32:11.396Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zot.png","metadata":{"files":{"readme":"README.org","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-12-11T20:34:47.000Z","updated_at":"2023-10-29T18:01:45.000Z","dependencies_parsed_at":"2022-09-12T08:40:59.538Z","dependency_job_id":null,"html_url":"https://github.com/zot/microfts","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zot%2Fmicrofts","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zot%2Fmicrofts/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zot%2Fmicrofts/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zot%2Fmicrofts/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zot","download_url":"https://codeload.github.com/zot/microfts/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247755426,"owners_count":20990617,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T07:11:55.028Z","updated_at":"2025-04-08T00:31:59.347Z","avatar_url":"https://github.com/zot.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"* Microfts\nA small full text indexing and search tool focusing on speed and\nspace.  Initial tests seem to indicate that the database takes about\ntwice as much space as the files it indexes.\n\nMicrofts implements a trigram GIN (generalized inverted index),\nrelying on [[http://www.lmdb.tech/doc/index.html][LMDB]] for storage, an open source, embedded, NOSQL,\nkey-value store library (so it's linked into microfts, not an external\nservice). It uses [[https://github.com/AskAlexSharov/lmdb-go/lmdb][AskAlexSharov's fork]] of [[https://github.com/bmatsuo/lmdb-goto][bmatsuo's lmdb-go package]] to\nconnect to it.\n\n* LICENSE\n\nMicrofts is MIT licensed, (c) 2020 Bill Burdick. All rights reserved.\n\n* Building\nNote that building may generate warning messages from lmdb-go's compilation of the LMDB C code.\n#+begin_src sh\ngo build -o microfts\n#+end_src\n\n* Examples\n** Creating a database\n#+begin_src sh\n./microfts create /tmp/bubba\n#+end_src\n** Adding Text\nThis adds /tmp/tst to the database in /tmp/bubba\n#+begin_src sh\nrm -rf /tmp/bubba\n./microfts create /tmp/bubba\ncat \u003e /tmp/tst \u003c\u003chere\none\ntwo three\nfour\nfour five\none two three\none three two\nhere\n./microfts input -file /tmp/bubba /tmp/tst\n#+end_src\n** Getting Info\n#+begin_src sh\n./microfts info /tmp/bubba\n#+end_src\n** Searching\n#+begin_src sh\n./microfts search /tmp/bubba \"one two\"\n#+end_src\n** Deleting a file's information\n#+begin_src sh\n./microfts delete /tmp/bubba /tmp/tst\n#+end_src\n** Reclaiming space in the database (only really matters after deleting a large file)\n#+begin_src sh\n./microfts compact /tmp/bubba\n#+end_src\n** Finding grams for a string\n#+begin_src sh\n./microfts grams \"this is a test\"\n./microfts grams -gx \"this is a test\"\n#+end_src\n** Finding candidates for grams\n#+begin_src sh\n./microfts search -candidates -grams /tmp/bubba thi tes est\n#+end_src\n* Usage\n** Exit Codes\n1. misc error\n2. file is missing\n3. file has changed\n4. file is unreadable\n5. no entry for file in database\n6. database missing\n** Help text\n#+begin_example\nUsage:\n   microfts info -groups DB\n                   print information about each group in the database,\n                   whether it is missing or changed\n                   whether it is an org-mode entry\n   microfts info [-chunks] DB GROUP\n                   print info for a GROUP\n                   -chunks also prints the chunks in GROUP if it has a corresponding file\n   microfts info [-grams] DB\n                   print info for database\n                   displays any groups which do not exist as files\n                   displays any groups which refer to files that have changed\n                   -grams displays distribution information about the trigram index\n   microfts create [-s GRAMSIZE] DB\n                   create DATABASE if it does not exist\n   microfts chunk [-nx | -data D | -dx] -d DELIM DB GROUP GRAMS\n   microfts chunk [-nx | -data D | -dx] -gx DB GROUP GRAMS\n                   ADD a chunk to GROUP with GRAMS.\n                   -d means use DELIM to split GRAMS.\n                   -gx means GRAMS is hex encoded with two bytes for each gram using base 37.\n   microfts grams [-gx] CHUNK\n                   output grams for CHUNK\n   microfts input [-nx | -dx | -org] DB FILE...\n                   For each FILE, create a group with its name and add a CHUNK for each chunk of input.\n                   Chunk data is the line number, offset, and length for each chunk (starting at 1).\n                   -org means chunks are org elements, otherwise chunks are lines\n   microfts delete [-nx] DB GROUP\n                   delete GROUP, its chunks, and tag entries.\n                   NOTE: THIS DOES NOT RECLAIM SPACE! USE COMPACT FOR THAT\n   microfts compact DB\n                   Reclaim space for deleted groups\n   microfts search [-n | -partial | -f | - limit N | -filter REGEXP | -u] DB TEXT\n                   query with TEXT for objects\n                   -f force search to skip changed and missing files instead of exiting\n                   -filter makes search only return chunks that match the REGEXP\n                   REGEXP syntax is here: https://golang.org/pkg/regexp/syntax/\n   microfts search -candidates [-grams | -gx | -gd | -n | -f | -limit N | -dx | -u] DB TERM1 ...\n                   dispay all candidates with the grams for TERMS without filtering\n                   -grams indicates TERMS are grams, otherwise extract grams from TERMS\n                   -gx: grams are in hex, -gd: grams are in decimal, otherwise they are 3-char strings\n   microfts data [-nx | -dx] DB GROUP\n                   get data for each doc in GROUP\n   microfts update [-t] DB\n                   reinput files that have changed\n                   delete files that have been removed\n                   -t means do a test run, printing what would have happened\n   microfts empty DB GROUP...\n                   Create empty GROUPs, ignoring existing ones\n\n   microfts is targeted for groups of small documents, like lines in a file.\n\n  -candidates\n        return docs with grams for search\n  -chunks\n        info DB GROUP: display all of a group's chunks\n  -comp string\n        compression type to use when creating a database\n  -d string\n        delimiter for unicode tags (default \",\")\n  -data string\n        data to define for object\n  -dx\n        use hex instead of unicode for object data\n  -end-format string\n        search: Go format string for the end of a group\n        Arg to printf is the FILE\n        The default value is \"\"\n        if -sexp is provided and -end-format is not, the default is \"\\n\"\n        Not used with search -fuzzy -sort\n  -f    search: skip changed and missing files instead of exiting\n  -file\n        search: display files rather than chunks\n  -filter string\n        search: filter results that match REGEXP\n  -format string\n        search: Go format string for each result\n        Args to printf are FILE POSITION LINE OFFSET PERCENTAGE CHUNK\n        FILE (string) is the name of the file\n        POSITION (int) is the 1-based character position of the chunk in the file\n        LINE (int) is the 1-based line of the chunk in the file\n        OFFSET (int) is the 0-based offset of the first match in the chunk\n        PERCENTAGE (float) is the percentage of a fuzzy match\n        Note that you can place [ARGNUM] after the % to pick a particular arg to format\n        The default format is %s:%[2]s:%[5]s\\n\n        -sexp sets format to (:filename \"%s\" :line %[3]d :offset %[4]d :text \"%[6]s\" :percent %[5]f)\n          Note that this will cause all matches to be on one (potentially large) line of output (default \"%[6]s:%[2]d:%[5]s\\n\")\n  -fuzzy float\n        search: specify a percentage fuzzy match\n  -gd\n        use decimal instead of unicode for grams\n  -grams\n        get: specify tags for intead of text\n        info: print gram coverage\n        search: specify grams instead of search terms\n  -groups\n        info: display information for each group\n  -gx\n        use hex instead of unicode for grams\n  -limit int\n        search: limit the number of results (default 9223372036854775807)\n  -n    only print line numbers for search\n  -org\n        index org-mode chunks instead of lines\n  -partial\n        search: allow partial matches in search\n  -prof\n        profile cpu\n  -s int\n        gram size\n  -sep\n        print candidates on separate lines\n  -sexp\n        search: output matches as an s-expression ((FILE (POS LINE OFFSET chunk) ... ) ... )\n        POS is the 1-based character position of the chunk in the file\n        LINE is the 1-based line of the chunk in the file\n        OFFSET is the 0-based offset of the first match in the chunk\n  -sort\n        search -fuzzy: sort all matches\n        This ignores start-format and end-format because it sorts all matches, regardless of\n        which file they come from.\n  -start-format string\n        search: Go format string for the start of a group\n        Arg to printf is the FILE\n        The default value is \"\"\n        Not used with search -fuzzy -sort\n  -t    update: do a test run, printing what would have happened\n  -u    search: update the database before searching\n  -v    verbose\n#+end_example\n* Notes\n** Grams\nOnly alphanumeric characters are represented faithfully in grams, other characters are considered whitespace and display as '.'. This makes a base-37 triple (0-9 and A-Z), which just fits into 2 bytes. Which is a big deal, spacewise.  Grams for starts of words begin with two whitespaces and ends of words end with one whitespace. There are no grams that end with two whitespaces.\n** Groups and chunks\nThe index consists of grams for chunks that belong to groups. Groups have names and the default is to use file names as group names.\n\n*** Supported groups and chunks\nMicrofts supports using file names as groups and splitting files into chunks either by line or by org-mode element, with the chunk data being a triple of line, offset, chunk-length. Searching finds candidate chunks by intersecting gram entries and then consults the files named by the groups for the actual content.\n*** Custom groups and chunks\nIf this is not sufficient, the command also supports custom usage: you can add chunks to a group, specifying data and grams. Searching can return candidate chunks for a set of grams.\n** Compressed representation for unsigned integers (lexicographically orderable)\n| 7 bits  | 0                   - 127                  | 0xxxxxxx                 |\n| 12 bits | 128                 - 4095                 | 1000xxxx X               |\n| 20 bits | 4096                - 1048575              | 1001xxxx X X             |\n| 28 bits | 1048576             - 268435455            | 1010xxxx X X X           |\n| 36 bits | 268435456           - 68719476735          | 1011xxxx X X X X         |\n| 44 bits | 68719476736         - 17592186044415       | 1100xxxx X X X X X       |\n| 52 bits | 17592186044416      - 4503599627370495     | 1101xxxx X X X X X X     |\n| 60 bits | 4503599627370496    - 1152921504606846975  | 1110xxxx X X X X X X X   |\n| 64 bits | 1152921504606846976 - 18446744073709551615 | 1111---- X X X X X X X X |\n** LMDB Trees\n*** Grams: GRAM-\u003e BLOCK\nGRAM is a 2-byte value\n|----------|\n| OID LIST |\n|----------|\n*** OID LISTS\n9 lists of oids: [9][]byte.\n\nNote -- this is probably too ornate and a simple byte array and a\ncount might have the same performance and space.\n|---------------|\n| # 1-byte OIDS |\n| # 2-byte OIDS |\n| # 3-byte OIDS |\n| # 4-byte OIDS |\n| # 5-byte OIDS |\n| # 6-byte OIDS |\n| # 7-byte OIDS |\n| # 8-byte OIDS |\n| # 9-byte OIDS |\n| OIDS          |\n|---------------|\n*** Gram 0 holds the info since 0 is not a legal gram\n|-----------------|\n| next unused oid |\n| next unused gid |\n| free oids       |\n| free gids       |\n|-----------------|\n*** Chunks: OID -\u003e BLOCK\nOIDS are compressed integers\n|-------------------------|\n| GID                     |\n| data (e.g. line number) |\n| gram count              |\n|-------------------------|\n*** Groups: GID -\u003e BLOCK\nGIDS are compressed integers\n|-----------------------------------|\n| NAME                              |\n| oid count                         |\n| last changed timestamp            |\n| validity (valid = 0, deleted = 1) |\n| org flag (whether -org was used)  |\n|-----------------------------------|\n*** Group Names: NAME-\u003eGID\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzot%2Fmicrofts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzot%2Fmicrofts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzot%2Fmicrofts/lists"}