{"id":24283618,"url":"https://github.com/codercooke/vectorfilesearch","last_synced_at":"2026-04-20T01:04:11.081Z","repository":{"id":271628150,"uuid":"914063170","full_name":"CoderCookE/vectorFileSearch","owner":"CoderCookE","description":"Crawl files system and search file contents with pg_vector","archived":false,"fork":false,"pushed_at":"2025-01-08T22:10:06.000Z","size":7,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-05T15:51:52.203Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CoderCookE.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-08T21:56:44.000Z","updated_at":"2025-01-09T18:20:38.000Z","dependencies_parsed_at":"2025-01-08T23:34:18.928Z","dependency_job_id":null,"html_url":"https://github.com/CoderCookE/vectorFileSearch","commit_stats":null,"previous_names":["codercooke/vectorfilesearch"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CoderCookE/vectorFileSearch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoderCookE%2FvectorFileSearch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoderCookE%2FvectorFileSearch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoderCookE%2FvectorFileSearch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoderCookE%2FvectorFileSearch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CoderCookE","download_url":"https://codeload.github.com/CoderCookE/vectorFileSearch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoderCookE%2FvectorFileSearch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32028550,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T00:18:06.643Z","status":"ssl_error","status_checked_at":"2026-04-20T00:17:31.068Z","response_time":55,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-16T04:17:07.246Z","updated_at":"2026-04-20T01:04:11.055Z","avatar_url":"https://github.com/CoderCookE.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# File Embedding Processor\n\nThis project allows you to generate embeddings for files in a specified directory, store them in a PostgreSQL database, and later search the database based on a query to retrieve relevant file paths and their corresponding embeddings. The project uses HuggingFace and PyTorch for generating embeddings and `psycopg2` for database interaction.\n\n## Features\n- **Embedding Generation**: Generates file embeddings using HuggingFace's transformer models or a custom embedding generation function.\n- **Database Storage**: Embeddings are stored in a PostgreSQL database, allowing for fast retrieval.\n- **Search**: Query the database using a user-defined question, retrieve the most relevant file embeddings, and display the corresponding file paths.\n\n## Prerequisites\n\n- **Python 3.7+**: This project uses Python for scripting.\n- **PostgreSQL**: You will need a running PostgreSQL instance to store embeddings.\n- **CUDA (optional)**: If you have a GPU, PyTorch will automatically use it for faster embedding generation.\n\n## Requirements\n\n### Python Dependencies\n\nThe following Python packages are required:\n\n- `psycopg2`: PostgreSQL database adapter for Python.\n- `numpy`: Package for numerical operations (used for embedding normalization).\n- `torch`: PyTorch, used for running HuggingFace models.\n- `transformers`: HuggingFace library for pretrained transformer models.\n- `argparse`: Command-line argument parser.\n- `langchain`: For embedding models (if using the HuggingFaceEmbeddings class from `langchain`).\n\nInstall these packages using the `requirements.txt` file:\n\n```bash\npip install -r requirements.txt\n```\n\n## Setup\n\n### Step 1: Set up PostgreSQL Database\n\n1. Install PostgreSQL on your machine, if it's not already installed.\n2. Create a new database (`file_vector`) to store the embeddings.\n3. Run the following SQL to create the required table:\n\n```sql\nCREATE TABLE embeddings_table (\n    file_path TEXT PRIMARY KEY,\n    embedding FLOAT8[]\n);\n```\n\nAlternatively, you can use the provided `setup_database.sh` script to automatically set up the database:\n\n```bash\nsh setup_database.sh\n```\n\n### Step 2: Set up Virtual Environment\n\n1. Create a virtual environment:\n\n   ```bash\n   python -m venv venv\n   ```\n\n2. Activate the virtual environment:\n   - On macOS/Linux:\n     ```bash\n     source venv/bin/activate\n     ```\n\n3. Install required Python dependencies:\n\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n### Step 3: Configuration\n\n- Update the database connection parameters in both `main.py` and `question.py` to match your PostgreSQL setup:\n\n```python\nDB_HOST = \"localhost\"\nDB_PORT = \"5432\"\nDB_NAME = \"file_vector\"\nDB_USER = \"your_database_user\"\nDB_PASSWORD = \"your_database_password\"\n```\n\n### Step 4: Prepare Ignore List (Optional)\n\nIf you want to exclude certain files or directories from processing, create a text file (e.g., `ignore_list.txt`) containing newline-separated file or directory paths to ignore.\n\nExample `ignore_list.txt`:\n```\npath/to/ignore1\nfile_to_ignore.txt\n```\n\n### Step 5: Run the Embedding Generation Script\n\nThe `main.py` script processes files in the specified directory, generates embeddings, and stores them in the database. You can pass an optional ignore list file.\n\nTo run the embedding generation:\n\n```bash\npython main.py --path /path/to/start/directory --ignore-list ignore_list.txt\n```\n\nWhere:\n- `/path/to/start/directory` is the path to the directory you want to process.\n- `ignore_list.txt` is an optional file containing a list of files or directories to ignore.\n\n### Step 6: Perform a Search\n\nThe `question.py` script allows you to search the database using a query string. It generates an embedding for the query and retrieves the most relevant files based on vector similarity.\n\nTo run a query:\n\n```bash\npython question.py --question \"your search query\"\n```\n\nThis will generate an embedding for the search query and return the top 5 matching files with their embedding values.\n\n## Example Usage\n\n### Generate Embeddings for Files:\n\n```bash\npython main.py --path /path/to/files --ignore-list ignore_list.txt\n```\n\n### Search for Relevant Files Based on Query:\n\n```bash\npython question.py --question \"What are the important files related to data processing?\"\n```\n\n### Database Query Example\n\nIf you want to manually query the database, you can use the following SQL query to find the most similar embeddings to a given query embedding:\n\n```sql\nSELECT file_path, embedding\nFROM embeddings_table\nORDER BY embedding \u003c=\u003e %s::vector  -- Ensure proper cast to vector type\nLIMIT 5;\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\nThis README includes the steps for setup, configuration, and usage. Let me know if you need more sections or clarification!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodercooke%2Fvectorfilesearch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodercooke%2Fvectorfilesearch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodercooke%2Fvectorfilesearch/lists"}