{"id":13590823,"url":"https://github.com/arunsupe/semantic-grep","last_synced_at":"2025-05-16T06:03:20.586Z","repository":{"id":250251100,"uuid":"833920230","full_name":"arunsupe/semantic-grep","owner":"arunsupe","description":"grep for words with similar meaning to the query","archived":false,"fork":false,"pushed_at":"2024-08-19T05:50:44.000Z","size":6040,"stargazers_count":1155,"open_issues_count":0,"forks_count":27,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-08T15:13:58.540Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arunsupe.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-26T03:37:34.000Z","updated_at":"2025-04-04T15:30:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"8580f1b4-fd2d-45cf-82d4-72f1616dbb25","html_url":"https://github.com/arunsupe/semantic-grep","commit_stats":null,"previous_names":["arunsupe/semantic-grep"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arunsupe%2Fsemantic-grep","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arunsupe%2Fsemantic-grep/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arunsupe%2Fsemantic-grep/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arunsupe%2Fsemantic-grep/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arunsupe","download_url":"https://codeload.github.com/arunsupe/semantic-grep/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254478160,"owners_count":22077675,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T16:00:50.831Z","updated_at":"2025-05-16T06:03:20.515Z","avatar_url":"https://github.com/arunsupe.png","language":"Go","funding_links":[],"categories":["Text Processing","Go","Files and Directories","Template Engines","文本处理","\u003ca name=\"text-search\"\u003e\u003c/a\u003eText search (alternatives to grep)"],"sub_categories":["Utility/Miscellaneous","Search","实用程序/Miscellaneous"],"readme":"# w2vgrep - Semantic Grep\n\nw2vgrep is a command-line tool that performs semantic searches on text input using word embeddings. It's designed to find semantically similar matches to the query, going beyond simple string matching. Supports multiple languages. The experience is designed to be similar to grep. \n\n\n## Example Usage\n\nSearch for words similar to \"death\" in Hemingway's \"The Old Man and the Sea\" with context and line numbers:\n\n```bash\ncurl -s 'https://gutenberg.ca/ebooks/hemingwaye-oldmanandthesea/hemingwaye-oldmanandthesea-00-t.txt' \\\n    | w2vgrep -C 2 -n --threshold=0.55 death\n```\n\nOutput:\n![alt text](demo/image.png)\n\nThis command:\n\n    - Fetches the text of \"The Old Man and the Sea\" from Project Gutenberg Canada\n    - Pipes the text to w2vgrep\n    - Searches for words semantically similar to \"death\"\n    - Uses a similarity threshold of 0.55 (-threshold 0.55)\n    - Displays 2 lines of context before and after each match (-C 2)\n    - Shows line numbers (-n)\n\nThe output will show matches with their similarity scores, highlighted words, context, and line numbers.\n\n## Features\n\n- Semantic search using word embeddings \n- Configurable similarity threshold\n- Context display (before and after matching lines)\n- Color-coded output \n- Support for multiple languages \n- Read from files or stdin\n- Configurable via JSON file and command-line arguments\n\n## Installation\n\nTwo files are absolutely needed: \n1. the w2vgrep binary\n2. the vector embedding model file\n3. (Optionally, a config.json file to tell w2vgrep where the embedding model is)\n\n**Using install script**:\n\n```bash\n# clone\ngit clone https://github.com/arunsupe/semantic-grep.git\ncd semantic-grep\n\n# run install:\n#   compiles using the local go compiler, installs in user/bin, \n#   downloads the model to $HOME/.config/semantic-grep\n#   makes config.json\nbash install.sh\n``` \n**Binary**:\n\n1. Download the latest binary release\n2. Download a vector embedding model (see below)\n3. Optionally, download the config.json to configure model location there (or do this from the command line)\n\n**From source (linux/osx)**:\n\n```bash\n# clone\ngit clone https://github.com/arunsupe/semantic-grep.git\ncd semantic-grep\n\n# build\ngo build -o w2vgrep\n\n# download a word2vec model using this helper script (see \"Word Embedding Model\" below)\nbash download-model.sh\n```\n\n## Usage\n\nBasic usage:\n\n./w2vgrep [options] \u003cquery\u003e [file]\n\nIf no file is specified, w2vgrep reads from standard input.\n\n### Command-line Options\n```\n-m, --model_path=     Path to the Word2Vec model file. Overrides config file\n-t, --threshold=      Similarity threshold for matching (default: 0.7)\n-A, --before-context= Number of lines before matching line\n-B, --after-context=  Number of lines after matching line\n-C, --context=        Number of lines before and after matching line\n-n, --line-number     Print line numbers\n-i, --ignore-case     Ignore case. \n-o, --only-matching   Output only matching words\n-l, --only-lines      Output only matched lines without similarity scores\n-f, --file=           Match patterns from file, one pattern per line. Like grep -f.\n```\n\n## Configuration\n\n`w2vgrep` can be configured using a JSON file. By default, it looks for `config.json` in the current directory, \"$HOME/.config/semantic-grep/config.json\" and \"/etc/semantic-grep/config.json\".\n\n\n## Word Embedding Model\n\n### Quick start:\n`w2vgrep` requires a word embedding model in __binary__ format. The default model loader uses the model file's extension to determine the type (.bin, .8bit.int). A few compatible model files are provided in this repo ([models/](models/)). Download one of the .bin files from the `models/` directory and update the path in config.json.\n\nNote: `git clone` will not download the large binary model files unless git lfs is installed in your machine. If you do not want to install git-lfs, just manually download the model .bin file and place it in the correct folder.\n\n\n### Support for multiple languages:\nFacebook's fasttext group have published word vectors in [157 languages](https://fasttext.cc/docs/en/crawl-vectors.html) - an amazing resource. I want to host these files on my github account, but alas, they are too big and $$$. Therefore, I have provided a small go program, [fasttext-to-bin](model_processing_utils/), that can make `w2vgrep` compatible binary models from this. (note: use the text files with \"__.vec.gz__\" extension, not the binary \".bin.gz\" files)\n\n```bash\n# e.g., for a French model:\ncurl -s 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fr.300.vec.gz' | gunzip -c | ./fasttext-to-bin -input - -output models/fasttext/cc.fr.300.bin\n\n# use it like so:\n# curl -s 'https://www.gutenberg.org/cache/epub/17989/pg17989.txt' \\\n#    | w2vgrep -C 2 -n -t 0.55 \\\n#           -model_path model_processing_utils/cc.fr.300.bin 'château'\n```\n\n### Roll your own:\nAlternatively, you can use pre-trained models (like Google's Word2Vec) or train your own using tools like gensim. Note though that there does not seem to be a standardized binary format (google's is different to facebook's fasttext or gensim's default _save()_). For `w2vgrep`, because efficiently loading the large model is key for performance, I have elected to keep the simplest format. \n\n\n### Testing the model by finding synonyms\nTo help troubleshoot the model, I added a `synonym-finder.go` to `./model_processing_utils/`. This program will find similar words to the query word above any threshold in the model.\n\n```bash\n# build\ncd model_processing_utils\ngo build synonym-finder.go\n\n#run\nsynonym-finder -model_path path/to/cc.zh.300.bin -threshold 0.6 合理性\n\n# Output\nWords similar to '合理性' with similarity \u003e= 0.60:\n科学性 0.6304\n合理性 1.0000\n正当性 0.6018\n公允性 0.6152\n不合理性 0.6094\n合法性 0.6219\n有效性 0.6374\n必要性 0.6499\n```\n\n\n## Decreasing the size of the model files\nThe model files are large (Gigabytes). Each word is typically represented using 300 dimension, 32 bit floating point vectors. Reducing dimensionality, to 100 or 150 dimensions, can produce smaller, memory efficient, faster, more performant models with minimal (maybe even better) accuracy. In `model_processing_utils/reduce-model-size`, I have written a program to reduce model dimensions. This can be used to reduce the size of any word2vec binary model used by w2vgrep. Use this like so:\n\n```bash\n# build\ncd model_processing_utils/reduce-model-size\ngo build .\n\n# run on large GoogleNews-vectors-negative300-SLIM.bin model (346MB) to make smaller\n# GoogleNews-vectors-negative100-SLIM.bin model (117MB)\n./reduce-pca -input ../../models/googlenews-slim/GoogleNews-vectors-negative300-SLIM.bin -output ../../models/googlenews-slim/GoogleNews-vectors-negative100-SLIM.bin\n\n# use this smaller model in w2vgrep like so\ncurl -s 'https://gutenberg.ca/ebooks/hemingwaye-oldmanandthesea/hemingwaye-oldmanandthesea-00-t.txt' | bin/w2vgrep.linux.amd64 -n -t 0.5 -m models/googlenews-slim/GoogleNews-vectors-negative100-SLIM.bin --line-number death\n\n```\n\n\n## A word about performance of the different embedding models\nDifferent models define \"similarity\" differently ([explaination](https://machinelearninginterview.com/topics/natural-language-processing/what-is-the-difference-between-word2vec-and-glove/)). However, for practical purposes, they seem equivalent enough.\n\n\n## Contributing\nContributions are welcome! Please feel free to submit a Pull Request.\n\n\n## License and attribution:\nThe code in this project is licensed under the MIT [License](LICENSE). \n\n**go-flags package:**\n\nThe go-flags package, used by the code in this project, is distributed under the BSD-3-Clause license. Please see the license information https://github.com/jessevdk/go-flags.\n\n**Word2Vec Model**:\n\nThis project uses a mirrored version of the word2vec-slim model, which is stored in the `models/googlenews-slim` directory. This model is distributed under the Apache License 2.0. For more information about the model, its original authors, and the license, please see the `models/googlenews-slim/ATTRIBUTION.md` file.\n\n**GloVe word vectors**:\n\nThis project uses a processed version of the GloVe word vectors, which is stored in the `models/glove` directory. This work is distributed under the Public Domain Dedication and License v1.0. For more information about the model, its original authors, and the license, please see the `models/glove/ATTRIBUTION.md` file.\n\n**Fasttext word vectors**:\n\nThis project uses a processed version of the fasttext word vectors, which is stored in the `models/fasttext` directory. This work is distributed under the Creative Commons Attribution-Share-Alike License 3.0. For more information about the model, its original authors, and the license, please see the `models/fasttext/ATTRIBUTION.md` file.\n\n\n## Sources of models in the web\n- Google's Word2Vec: from https://github.com/mmihaltz/word2vec-GoogleNews-vectors\n- A slim version of the above: GoogleNews-vectors-negative300-SLIM.bin.gz model from https://github.com/eyaler/word2vec-slim/\n- Stanford NLP group's Global Vectors for Word Representation (glove) model [source](https://nlp.stanford.edu/projects/glove/): binary version is in mirrored in [models/glove/](models/glove/).  \n- Facebook fasttext vectors: https://fasttext.cc/docs/en/crawl-vectors.html","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farunsupe%2Fsemantic-grep","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farunsupe%2Fsemantic-grep","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farunsupe%2Fsemantic-grep/lists"}