{"id":29133256,"url":"https://github.com/do-me/sff","last_synced_at":"2025-06-30T07:02:38.352Z","repository":{"id":298928219,"uuid":"1001557312","full_name":"do-me/sff","owner":"do-me","description":null,"archived":false,"fork":false,"pushed_at":"2025-06-13T16:17:17.000Z","size":147,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-13T17:32:59.841Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/do-me.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-13T15:29:48.000Z","updated_at":"2025-06-13T16:16:43.000Z","dependencies_parsed_at":"2025-06-13T17:34:08.356Z","dependency_job_id":"97294bda-bb7a-4c97-a979-56b152ef7684","html_url":"https://github.com/do-me/sff","commit_stats":null,"previous_names":["do-me/sff"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/do-me/sff","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fsff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fsff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fsff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fsff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/do-me","download_url":"https://codeload.github.com/do-me/sff/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fsff/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262727705,"owners_count":23354665,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-30T07:02:36.474Z","updated_at":"2025-06-30T07:02:38.328Z","avatar_url":"https://github.com/do-me.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SemanticFileFinder (sff)\n\n[![crates.io](https://img.shields.io/crates/v/sff.svg)](https://crates.io/crates/sff)\n[![License: MIT OR Apache-2.0](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n[![GitHub stars](https://img.shields.io/github/stars/do-me/sff.svg?style=social)](https://github.com/your_username/sff)\n\n**sff (SemanticFileFinder)** is a command-line tool that rapidly searches for files in a given directory based on the semantic meaning of your query. It leverages sentence embeddings through `model2vec-rs` to understand content, not just keywords. It reads `.txt`, `.md`, and `.mdx` files, chunks their content, and ranks them by similarity to find the most relevant text snippets.\n\n## Installation \u0026 Quick Start\n\nOnce `sff` is published on crates.io, you can install it using Cargo:\n\n```bash\ncargo install sff\nsff \"project ideas for rust\"\n```\nEnsure `~/.cargo/bin` is in your system's `PATH`. Deafult is cwd with `--path .`\n\nI use this tool myself to scan my personal notes. In the past these were simple .txt files in a folder until I migrated everything to iCloud + Obsidian. Here is some sample output from some random notes:\n\n![My notess](sample_output.png)\n\n## Performance \n\ntl;dr: under 250ms for English-only models on ~2500 files and 10k chunks (with 20 words per chunk) on an M3 Max. If you need the best possible results and good multilingual retrieval, go for `minishlab/potion-multilingual-128M`.\nElse, stick to the default with `minishlab/potion-retrieval-32M`. Keep an eye on new model2vec models here: https://huggingface.co/minishlab.\n\n| Command                                                                     | Model                    | Query      | Files | Chunks | Time (ms) |\n| --------------------------------------------------------------------------- | ------------------------ | ---------- | ----- | ------ | --------- |\n| `sff -m \"minishlab/potion-base-8M\" \"javascript\"`           | potion-base-8M           | javascript | 2537  | 10000  | 209.34    |\n| `sff -m \"minishlab/potion-retrieval-32M\" \"javascript\"`     | potion-retrieval-32M     | javascript | 2537  | 10000  | 249.95    |\n| `sff -m \"minishlab/potion-multilingual-128M\" \"javascript\"` | potion-multilingual-128M | javascript | 2537  | 10000  | 1001.69   |\n\n## Features\n\n*   **Semantic Search:** Finds files based on meaning, not just exact keyword matches.\n*   **Supported Files:** Scans `.txt`, `.md`, and `.mdx` files.\n*   **Content Chunking:** Breaks down documents into smaller, manageable chunks for precise matching.\n*   **Embedding Powered:** Uses `model2vec-rs` to generate text embeddings. Models are typically downloaded from Hugging Face Hub.\n*   **Fast \u0026 Parallelized:** Utilizes Rayon for parallel processing of file discovery, embedding generation, and similarity calculation.\n*   **Customizable:**\n    *   Specify search directory.\n    *   Define your semantic query.\n    *   Choose the embedding model (Hugging Face Hub or local path).\n    *   Limit the number of results.\n    *   Enable recursive search through subdirectories.\n*   **Verbose Mode:** Offers detailed timing information for performance analysis.\n*   **Clickable File Paths:** Output paths are formatted for easy opening in most terminals.\n\n## Usage\n\nThe basic command structure is:\n\n```bash\nsff [OPTIONS] \u003cQUERY\u003e...\n```\n\n**Examples:**\n\n*   Search in the current directory for \"machine learning techniques\":\n    ```bash\n    sff \"machine learning techniques\"\n    ```\n\n*   Search recursively in `~/Documents/notes` for \"project ideas for rust\":\n    ```bash\n    sff -p ~/Documents/notes -r \"project ideas for rust\"\n    ```\n\n*   Use a different model and limit results to 5:\n    ```bash\n    sff -m \"minishlab/potion-multilingual-128M\" -l 5 \"benefits of parallel computing\"\n    ```\n\n**All Options:**\n\nYou can view all available options with `sff --help`:\n\n```\nsff: Fast semantic file finder\n\nUsage: sff [OPTIONS] \u003cQUERY\u003e...\n\nArguments:\n  \u003cQUERY\u003e...\n          The semantic search query\n\nOptions:\n  -p, --path \u003cPATH\u003e\n          The directory to search in\n          [default: .]\n\n  -m, --model \u003cMODEL\u003e\n          Model to use for embeddings, from Hugging Face Hub or local path\n          [default: minishlab/potion-retrieval-32M]\n\n  -l, --limit \u003cLIMIT\u003e\n          Number of top results to display\n          [default: 10]\n\n  -r, --recursive\n          Search recursively through all subdirectories\n\n  -v, --verbose\n          Enable verbose mode to print detailed timings for nerds\n\n  -h, --help\n          Print help (see more with '--help')\n\n  -V, --version\n          Print version\n```\n\n## Models\n\n`sff` uses `model2vec-rs`, which typically downloads models from the [Hugging Face Hub](https://huggingface.co/models). The default model is `minishlab/potion-retrieval-32M`. You can specify any compatible sentence transformer model available on the Hub or a local path to a model. The first time you use a new model, it will be downloaded, which might take some time.\n\n## Roadmap \n\n### Mising Args \n- batch size - currently 128 texts of 20 words each are inferenced at the same time\n- filetypes - currently only .txt, .md, .mdx but should be customizable as args\n\n### Chunker Options\n- For now, add more arguments like number of words for chunking\n- In the long run, add https://github.com/benbrandt/text-splitter as chunker and allow the user to customize chunking\n\n### Output Options\n- Add multiple export options for the output table like JSON, CSV, Parquet and markdown (for potential LLM-pipelines). Possibly I'd just add polars or similar as dependency and use their exporter https://docs.pola.rs/api/python/dev/reference/io.html\n\nPR's always welcome!\n\n## FAQ \n\n### MacOS: Search folders in iCloud\nIf you want to search any folder on iCloud (e.g. your Obsidian vault) you need to grant full disk access to your shell, e.g. iTerm2 in the system settings:\n\n![image](https://github.com/user-attachments/assets/ed059474-7f58-443d-8f04-477506715411)\n\nReopen the shell and the problem should be fixed.\n\n## License\n\n* MIT\n\n---\nBuilt by Dominik Weckmüller. If you like semantic search, check out my other work on [GitHub](https://github.com/do-me) e.g. [SemanticFinder](https://github.com/do-me/SemanticFinder)!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdo-me%2Fsff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdo-me%2Fsff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdo-me%2Fsff/lists"}