{"id":51091116,"url":"https://github.com/xarantolus/search-engine","last_synced_at":"2026-06-24T02:01:48.062Z","repository":{"id":345073446,"uuid":"1110166669","full_name":"xarantolus/search-engine","owner":"xarantolus","description":"Custom search engine for all kinds of documents and storage services","archived":false,"fork":false,"pushed_at":"2026-03-17T14:12:06.000Z","size":269,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-18T04:45:14.788Z","etag":null,"topics":["apache-tika","github","gitlab","indexing","meilisearch","nas","network-drive","search-engine","self-hosted"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xarantolus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-04T20:03:28.000Z","updated_at":"2026-03-17T14:12:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/xarantolus/search-engine","commit_stats":null,"previous_names":["xarantolus/search-engine"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/xarantolus/search-engine","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xarantolus%2Fsearch-engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xarantolus%2Fsearch-engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xarantolus%2Fsearch-engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xarantolus%2Fsearch-engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xarantolus","download_url":"https://codeload.github.com/xarantolus/search-engine/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xarantolus%2Fsearch-engine/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34713791,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-24T02:00:07.484Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-tika","github","gitlab","indexing","meilisearch","nas","network-drive","search-engine","self-hosted"],"created_at":"2026-06-24T02:01:46.448Z","updated_at":"2026-06-24T02:01:48.052Z","avatar_url":"https://github.com/xarantolus.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Search Engine\nThis search engine allows you to index all types of documents, no matter where they are stored. It supports both traditional text search and a user-customizable use of text embeddings.\n\nI personally use it to index millions of documents of different types, which results in \u003e50GB of raw text. Despite this, most search queries are pretty fast.\n\nMultiple document stores can be searched with this tool:\n- Anything that can be mounted via a network mount\n- Anything that can be synced to a local directory, like Sharepoint/OneDrive\n- GitLab repositories, including wikis, PRs and issues\n- Confluence Spaces\n- Public Websites\n\nBasically, the tool extracts text from all documents in these places, makes them searchable and links to them.\n\n## Setup instructions\n1. Create a GitLab Application. This is used to log people in and check if they are in the correct group, as specified by the `ALLOWED_GITLAB_GROUP_ID` variable.\n2. Create the configuration and `.env` file.\n3. Copy over the Docker Compose file and make any adjustments deemed necessary.\n\n\u003cdetails\u003e\n\u003csummary\u003eClick here for more details\u003c/summary\u003e\n\n### 1. GitLab Application (login provider)\nCreate an application with `read_api`, `read_user` and `openid` permission on GitLab (`https://gitlab.example.com/-/user_settings/applications`) (Preferences -\u003e Applications -\u003e Add new application).\nIf your server will expose the service on `https://search.example:8090`, you should add `https://search.example:8090/callback` to the allowed URLs in the GitLab application configuration.\n\nThen put the host info and credentials you get after creating the application into the `.env` file:\n\n```\nHOST_EXTERNAL_URL=http://\u003cmy-server-url\u003e:8090\nGITLAB_INSTANCE_URL=https://\u003cgitlab-instance-url\u003e\nGITLAB_APPLICATION_ID=\nGITLAB_APPLICATION_SECRET=\n# GitLab Group ID of users which are allowed to log in (it's an integer)\nALLOWED_GITLAB_GROUP_ID=12345\n```\n\n### 2. Configuration\nFirst, we need a configuration file. There is [an example configuration file](example-config.yml) with a lot of comments that explain how to use it. Basically, the configuration file defines which places are searchable, and how to access them.\n\nAdditionally, set up an `.env` file like this:\n\n```\n# Master Key that is used for logging into Meilisearch\n# Must have sufficient complexity, otherwise Meilisearch just rejects it.\nMEILI_MASTER_KEY=\n# GitLab API Key that is used for cloning repositories and indexing issues/PRs.\n# It requires the read_api, api, and ai_features scopes.\nGITLAB_API_KEY=glpat-...\n\n# GitLab OAuth settings. Explained in GitLab Application section in README\nHOST_EXTERNAL_URL=http://search.example:8090\nGITLAB_INSTANCE_URL=https://gitlab.example.com\nGITLAB_APPLICATION_ID=\nGITLAB_APPLICATION_SECRET=\n# Users must be in this group to access the search tool, otherwise they are denied access. This should be a number, not the group name.\nALLOWED_GITLAB_GROUP_ID=\n\n# Custom Environment variables that will also be available\n# when evaluating e.g. the mount commands in the config file\nNAS_USER=\nNAS_PW=\n```\n\n### 3. Docker Setup\nNow put the `docker-compose.yml` file and `.env` file in the same directory on your server that will host the service.\n\nEnsure you are logged into the container registry:\n\nThen, to start the server, run this:\n```\ndocker compose up\n```\n\nNow the server should be available. It will take some time to index stuff.\n\n## Permissions\nThis search engine is built with permission management in mind. It will index all available documents, but at search time, only the ones a user has access to will be returned.\n\nThe way this works is the following: every indexing item (e.g. a network mount, a git repo etc.) is associated with a \"permission tag\" (e.g. `ORG-Gitlab`).\n\nThen, we define permission groups that have multiple tags associated with them. A user can be part of a group, and their group memberships define the tags they have access to.\n\nWe can give one permission group to a user by default, so any newly logged in user has access to a few basic resources (e.g. repositories in the GitLab group that is required for loggin in).\n\nTo edit user permissions, an admin user (those that have their numeric GitLab User ID in `admin_gitlab_ids` in the config) can go to `http://search.example:8090/admin` and edit permissions.\n\n\n\u003c/details\u003e\n\n## Development\nTo add a new source of documents, please add it to the scraper package, and then initialize your scraper from the [`indexer/main.go`](indexer/main.go) file with values from the configuration.\n\n### Design Justifications\nIf you are reasonable, you will be surprised by the number of different services defined in the [`docker-compose.yml`](docker-compose.yaml) file. Let me explain why this is necessary.\n\nTL;DR: `indexer` and `searcher` are split to reduce attack surface.\n\nServices:\n- `meilisearch` \u0026 `tika`: services maintained by their own teams that we use unmodified\n- `indexer`: indexes network mounts, GitLab instances etc. Initially, this was a \"background thread\" of the search service, however, it was split out due to security considerations: if there was some kind of path traversal vulnerability in our user-facing backend code, they might be able to access any file on a NAS, as that is mounted into the same container. If we have a separate indexer that is not exposed to the outside world, the attack surface is reduced.\n- `searcher`: Takes in search requests and forwards them to Meilisearch. It also does some post-processing on the search results to reduce bandwidth used (as in: only send back the most relevant section)\n- `embedder`: Generates text embeddings if enabled, needs a GPU\n\n\n### Updating Meilisearch\nIf there is a new Meilisearch version, it is possible that the index format is no longer supported. You could [migrate it via a dump](https://www.meilisearch.com/docs/learn/update_and_migration/updating), or just ignore that and remove the old data (as in, just `rm -rf meili_data`).\n\nSince text extraction is usually cached, only the reindexing of the content is required.\nAlso, don't just update Meilisearch and assume the search tool will still work - likely, the client library needs to be updated as well.\n\n### [License](LICENSE)\nThis is free as in freedom software. Do whatever you like with it.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxarantolus%2Fsearch-engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxarantolus%2Fsearch-engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxarantolus%2Fsearch-engine/lists"}