{"id":24160300,"url":"https://github.com/sist2app/sist2","last_synced_at":"2025-05-14T23:05:52.618Z","repository":{"id":36100717,"uuid":"211434953","full_name":"sist2app/sist2","owner":"sist2app","description":"Lightning-fast file system indexer and search tool","archived":false,"fork":false,"pushed_at":"2025-03-19T23:23:38.000Z","size":60515,"stargazers_count":1029,"open_issues_count":76,"forks_count":60,"subscribers_count":22,"default_branch":"master","last_synced_at":"2025-04-13T20:41:05.792Z","etag":null,"topics":["c","elasticsearch","sqlite","vuejs"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sist2app.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-09-28T02:54:30.000Z","updated_at":"2025-04-13T17:06:29.000Z","dependencies_parsed_at":"2023-02-14T03:31:46.573Z","dependency_job_id":"27a83317-6574-4fe4-9af9-656b33624208","html_url":"https://github.com/sist2app/sist2","commit_stats":null,"previous_names":["sist2app/sist2","simon987/sist2"],"tags_count":84,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sist2app%2Fsist2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sist2app%2Fsist2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sist2app%2Fsist2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sist2app%2Fsist2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sist2app","download_url":"https://codeload.github.com/sist2app/sist2/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254243358,"owners_count":22038046,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","elasticsearch","sqlite","vuejs"],"created_at":"2025-01-12T16:01:46.592Z","updated_at":"2025-05-14T23:05:47.607Z","avatar_url":"https://github.com/sist2app.png","language":"C","readme":"![GitHub](https://img.shields.io/github/license/sist2app/sist2.svg)\n[![CodeFactor](https://www.codefactor.io/repository/github/sist2app/sist2/badge?s=05daa325188aac4eae32c786f3d9cf4e0593f822)](https://www.codefactor.io/repository/github/sist2app/sist2)\n[![Development snapshots](https://ci.simon987.net/api/badges/simon987/sist2/status.svg)](https://files.simon987.net/.gate/sist2/simon987_sist2/)\n\n**Demo**: [sist2.simon987.net](https://sist2.simon987.net/)\n\n**Community URL:** [Discord](https://discord.gg/2PEjDy3Rfs)\n\n# sist2\n\nsist2 (Simple incremental search tool)\n\n*Warning: sist2 is in early development*\n\n![search panel](docs/sist2.gif)\n\n## Features\n\n* Fast, low memory usage, multi-threaded\n* Manage \u0026 schedule scan jobs with simple web interface (Docker only)\n* Mobile-friendly Web interface\n* Extracts text and metadata from common file types \\*\n* Generates thumbnails \\*\n* Incremental scanning\n* Manual tagging from the UI and automatic tagging based on file attributes via [user scripts](docs/scripting.md)\n* Recursive scan inside archive files \\*\\*\n* OCR support with tesseract \\*\\*\\*\n* Stats page \u0026 disk utilisation visualization\n* Named-entity recognition (client-side) \\*\\*\\*\\*\n\n\\* See [format support](#format-support)    \n\\*\\* See [Archive files](#archive-files)    \n\\*\\*\\* See [OCR](#ocr)    \n\\*\\*\\*\\* See [Named-Entity Recognition](#NER)\n\n## Getting Started\n\n### Using Docker Compose *(Windows/Linux/Mac)*\n\n```yaml\nservices:\n  elasticsearch:\n    image: elasticsearch:7.17.9\n    restart: unless-stopped\n    volumes:\n      # This directory must have 1000:1000 permissions (or update PUID \u0026 PGID below)\n      - /data/sist2-es-data/:/usr/share/elasticsearch/data\n    environment:\n      - \"discovery.type=single-node\"\n      - \"ES_JAVA_OPTS=-Xms2g -Xmx2g\"\n      - \"PUID=1000\"\n      - \"PGID=1000\"\n  sist2-admin:\n    image: sist2app/sist2:x64-linux\n    restart: unless-stopped\n    volumes:\n      - /data/sist2-admin-data/:/sist2-admin/\n      - /\u003cpath to index\u003e/:/host\n    ports:\n      - 4090:4090\n      # NOTE: Don't expose this port publicly!\n      - 8080:8080\n    working_dir: /root/sist2-admin/\n    entrypoint: python3\n    command:\n      - /root/sist2-admin/sist2_admin/app.py\n```\n\nNavigate to http://localhost:8080/ to configure sist2-admin.\n\n### Using the executable file *(Linux/WSL only)*\n\n1. Choose search backend (See [comparison](#search-backends)):\n    * **Elasticsearch**: have an Elasticsearch (version \u003e= 6.8.X, ideally \u003e=7.14.0) instance running\n        1. Download [from official website](https://www.elastic.co/downloads/elasticsearch)\n        2. *(or)* Run using docker:\n            ```bash\n            docker run -d -p 9200:9200 -e \"discovery.type=single-node\" elasticsearch:7.17.9\n            ```\n    * **SQLite**: No installation required\n\n2. Download the [latest sist2 release](https://github.com/sist2app/sist2/releases).\n   Select the file corresponding to your CPU architecture and mark the binary as executable with `chmod +x`.\n3. See [usage guide](docs/USAGE.md) for command line usage.\n\nExample usage:\n\n1. Scan a directory: `sist2 scan ~/Documents --output ./documents.sist2`\n2. Prepare search index:\n    * **Elasticsearch**: `sist2 index --es-url http://localhost:9200 ./documents.sist2`\n    * **SQLite**: `sist2 sqlite-index --search-index ./search.sist2 ./documents.sist2`\n3. Start web interface: \n   * **Elasticsearch**: `sist2 web ./documents.sist2`\n   * **SQLite**: `sist2 web --search-index ./search.sist2 ./documents.sist2`\n\n## Format support\n\n| File type                                                                 | Library                                                                      | Content  | Thumbnail   | Metadata                                                                                                                               |\n|:--------------------------------------------------------------------------|:-----------------------------------------------------------------------------|:---------|:------------|:---------------------------------------------------------------------------------------------------------------------------------------|\n| pdf,xps,fb2,epub                                                          | MuPDF                                                                        | text+ocr | yes         | author, title                                                                                                                          |\n| cbz,cbr                                                                   | [libscan](https://github.com/sist2app/sist2/tree/master/third-party/libscan) | -        | yes         | -                                                                                                                                      |\n| `audio/*`                                                                 | ffmpeg                                                                       | -        | yes         | ID3 tags                                                                                                                               |\n| `video/*`                                                                 | ffmpeg                                                                       | -        | yes         | title, comment, artist                                                                                                                 |\n| `image/*`                                                                 | ffmpeg                                                                       | ocr      | yes         | [Common EXIF tags](https://github.com/sist2app/sist2/blob/efdde2734eca9b14a54f84568863b7ffd59bdba3/src/parsing/media.c#L190), GPS tags |\n| raw, rw2, dng, cr2, crw, dcr, k25, kdc, mrw, pef, xf3, arw, sr2, srf, erf | LibRaw                                                                       | no       | yes         | Common EXIF tags, GPS tags                                                                                                             |\n| ttf,ttc,cff,woff,fnt,otf                                                  | Freetype2                                                                    | -        | yes, `bmp`  | Name \u0026 style                                                                                                                           |\n| `text/plain`                                                              | [libscan](https://github.com/sist2app/sist2/tree/master/third-party/libscan) | yes      | no          | -                                                                                                                                      |\n| html, xml                                                                 | [libscan](https://github.com/sist2app/sist2/tree/master/third-party/libscan) | yes      | no          | -                                                                                                                                      |\n| tar, zip, rar, 7z, ar ...                                                 | Libarchive                                                                   | yes\\*    | -           | no                                                                                                                                     |\n| docx, xlsx, pptx                                                          | [libscan](https://github.com/sist2app/sist2/tree/master/third-party/libscan) | yes      | if embedded | creator, modified_by, title                                                                                                            |\n| doc (MS Word 97-2003)                                                     | antiword                                                                     | yes      | no          | author, title                                                                                                                          |\n| mobi, azw, azw3                                                           | libmobi                                                                      | yes      | yes         | author, title                                                                                                                          |\n| wpd (WordPerfect)                                                         | libwpd                                                                       | yes      | no          | *planned*                                                                                                                              |\n| json, jsonl, ndjson                                                       | [libscan](https://github.com/sist2app/sist2/tree/master/third-party/libscan) | yes      | -           | -                                                                                                                                      |\n\n\\* *See [Archive files](#archive-files)*\n\n### Archive files\n\n**sist2** will scan files stored into archive files (zip, tar, 7z...) as if they were directly in the file system.\nRecursive (archives inside archives)\nscan is also supported.\n\n**Limitations**:\n\n* Support for parsing media files with formats that require *seek* (e.g. `.gif`, `.mp4` w/ fragmented metadata etc.)\n  is limitted (see `--mem-buffer` option)\n* Archive files are scanned sequentially, by a single thread. On systems where\n  **sist2** is not I/O bound, scans might be faster when larger archives are split into smaller parts.\n\n### OCR\n\nYou can enable OCR support for ebook (pdf,xps,fb2,epub) or image file types with the\n`--ocr-lang \u003clang\u003e` option in combination with `--ocr-images` and/or `--ocr-ebooks`.\nDownload the language data files with your package manager (`apt install tesseract-ocr-eng`) or\ndirectly [from Github](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files).\n\nThe `sist2app/sist2` image comes with common languages\n(hin, jpn, eng, fra, rus, spa, chi_sim, deu, pol) pre-installed.\n\nYou can use the `+` separator to specify multiple languages. The language\nname must be identical to the `*.traineddata` file installed on your system\n(use `chi_sim` rather than `chi-sim`).\n\nExamples:\n\n```bash\nsist2 scan --ocr-ebooks --ocr-lang jpn ~/Books/Manga/\nsist2 scan --ocr-images --ocr-lang eng ~/Images/Screenshots/\nsist2 scan --ocr-ebooks --ocr-images --ocr-lang eng+chi_sim ~/Chinese-Bilingual/\n```\n\n### Search backends\n\nsist2 v3.0.7+ supports SQLite search backend. The SQLite search backend has\nfewer features and generally comparable query performance for medium-size\nindices, but it uses much less memory and is easier to set up.\n\n|                                              |                       SQLite                        |                                                             Elasticsearch                                                             |\n|----------------------------------------------|:---------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------:|\n| Requires separate search engine installation |                                                     |                                                                   ✓                                                                   |\n| Memory footprint                             |                        ~20MB                        |                                                                \u003e500MB                                                                 |\n| Query syntax                                 |      [fts5](https://www.sqlite.org/fts5.html)       | [query_string](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax) |\n| Fuzzy search                                 |                                                     |                                                                   ✓                                                                   |\n| Media Types tree real-time updating          |                                                     |                                                                   ✓                                                                   |\n| Manual tagging                               |                          ✓                          |                                                                   ✓                                                                   |\n| User scripts                                 |                          ✓                          |                                                                   ✓                                                                   |\n| Media Type breakdown for search results      |                                                     |                                                                   ✓                                                                   |\n| Embeddings search                            |                      ✓ *O(n)*                       |                                                              ✓ *O(logn)*                                                              |\n\n### NER\n\nsist2 v3.0.4+ supports named-entity recognition (NER). Simply add a supported repository URL to\n**Configuration** \u003e **Machine learning options** \u003e **Model repositories**\nto enable it.\n\nThe text processing is done in your browser, no data is sent to any third-party services.\nSee [sist2app/sist2-ner-models](https://github.com/sist2app/sist2-ner-models) for more details.\n\n#### List of available repositories:\n\n| URL                                                                                                     | Maintainer                              | Purpose |\n|---------------------------------------------------------------------------------------------------------|-----------------------------------------|---------|\n| [sist2app/sist2-ner-models](https://raw.githubusercontent.com/sist2app/sist2-ner-models/main/repo.json) | [sist2app](https://github.com/sist2app) | General |\n\n\u003cdetails\u003e\n  \u003csummary\u003eScreenshot\u003c/summary\u003e\n\n![ner](docs/ner.png)\n\n\u003c/details\u003e\n\n## Build from source\n\nYou can compile **sist2** by yourself if you don't want to use the pre-compiled binaries\n\n### Using docker\n\n```bash\ngit clone --recursive https://github.com/sist2app/sist2/\ncd sist2\ndocker build . -t my-sist2-image\n# Copy sist2 executable from docker image\ndocker run --rm --entrypoint cat my-sist2-image /root/sist2 \u003e sist2-x64-linux\n```\n\n### Using a linux computer\n\n1. Install compile-time dependencies\n\n   ```bash\n   apt install gcc g++ python3 yasm ragel automake autotools-dev wget libtool libssl-dev curl zip unzip tar xorg-dev libglu1-mesa-dev libxcursor-dev libxml2-dev libxinerama-dev gettext nasm git nodejs\n   ```\n\n2. Install vcpkg using my fork: https://github.com/sist2app/vcpkg\n3. Install vcpkg dependencies\n\n    ```bash\n    vcpkg install openblas curl[core,openssl] sqlite3[core,fts5,json1] cpp-jwt pcre cjson brotli libarchive[core,bzip2,libxml2,lz4,lzma,lzo] pthread tesseract libxml2 libmupdf[ocr] gtest mongoose libmagic libraw gumbo ffmpeg[core,avcodec,avformat,swscale,swresample,webp,opus,mp3lame,vpx,zlib]\n    ```\n\n4. Build\n    ```bash\n    git clone --recursive https://github.com/sist2app/sist2/\n    (cd sist2-vue; npm install; npm run build)\n    (cd sist2-admin/frontend; npm install; npm run build)\n    cmake -DSIST_DEBUG=off -DCMAKE_TOOLCHAIN_FILE=\u003cVCPKG_ROOT\u003e/scripts/buildsystems/vcpkg.cmake .\n    make\n    ```\n","funding_links":[],"categories":["Software","C"],"sub_categories":["Search Engines"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsist2app%2Fsist2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsist2app%2Fsist2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsist2app%2Fsist2/lists"}