{"id":28946283,"url":"https://github.com/amadeusitgroup/docs2vecs","last_synced_at":"2026-03-05T09:01:15.040Z","repository":{"id":279768994,"uuid":"906295208","full_name":"AmadeusITGroup/docs2vecs","owner":"AmadeusITGroup","description":"CLI that helps with docs splitting, embedding and exposing them in a seamless manner","archived":false,"fork":false,"pushed_at":"2026-02-12T08:59:58.000Z","size":3627,"stargazers_count":6,"open_issues_count":15,"forks_count":8,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-02-12T16:52:32.523Z","etag":null,"topics":["azure-ai","chromadb","cli-tool","data-ingestion","docker","document-processing","embeddings","llm","mongodb","natural-language-processing","python","rag","semantic-search","text-embedding","vector-database"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AmadeusITGroup.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-12-20T15:24:13.000Z","updated_at":"2026-02-12T08:54:23.000Z","dependencies_parsed_at":"2026-03-05T09:01:04.861Z","dependency_job_id":null,"html_url":"https://github.com/AmadeusITGroup/docs2vecs","commit_stats":null,"previous_names":["amadeusitgroup/docs2vecs"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/AmadeusITGroup/docs2vecs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmadeusITGroup%2Fdocs2vecs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmadeusITGroup%2Fdocs2vecs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmadeusITGroup%2Fdocs2vecs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmadeusITGroup%2Fdocs2vecs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AmadeusITGroup","download_url":"https://codeload.github.com/AmadeusITGroup/docs2vecs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmadeusITGroup%2Fdocs2vecs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30117470,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T08:19:04.902Z","status":"ssl_error","status_checked_at":"2026-03-05T08:17:37.148Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["azure-ai","chromadb","cli-tool","data-ingestion","docker","document-processing","embeddings","llm","mongodb","natural-language-processing","python","rag","semantic-search","text-embedding","vector-database"],"created_at":"2025-06-23T08:05:35.653Z","updated_at":"2026-03-05T09:01:15.024Z","avatar_url":"https://github.com/AmadeusITGroup.png","language":"Python","readme":"# Overview\nThis tool, `docs2vecs` is a library/cli that allows you to vectorize your data, enabling you to create RAG powered applications.\n\n![data_ingestion](./docs/readme/vectorize.gif)\n\n\nFor these applications, `docs2vecs` simplifies the entire process:\n* Data ingestion: Use the `indexer` to run the data ingestion pipeline: data retrieval, chunking, embedding, and storing resulting vectors in a Vector DB.\n* Build proof of concepts: `docs2vecs` allows you to quickly create a RAG prototype by using a local ChromaDB as vector store and a `server` mode to chat with your data.\n\n\nThe `docs2vecs` project is managed with [uv](https://docs.astral.sh/uv/).\n\n# Usage\nYou can use `docs2vecs` in three ways:\n1. Install from PyPI\n2. Install locally from source\n2. Run from Docker/Podman image.\n\n## Install from PyPI\nYou can install `docs2vecs` from PyPI using pip:\n```sh\npip install docs2vecs\n```\nor\n```sh\npip install docs2vecs[all]\n```\nto install all the extra dependencies.\n\n## Run locally from source\n```sh\ngh repo clone AmadeusITGroup/docs2vecs\ncd docs2vecs\nuv run --directory src docs2vecs --help\n```\n\n## Run from Docker image\n\n```sh\nexport OCI_ENGINE=podman # or docker\nexport DOCS2VECS_VERSION=latest # or a specific version\n${OCI_ENGINE}  run -it --rm \\\n    ghcr.io/amadeusitgroup/docs2vecs:latest \\\n    --help # or any other valid command that can be run with docs2vecs\n```\n\n# Documentation\n\n\u003cdetails\u003e\u003csummary\u003eExpand me if you would like to find out how to vectorize your data\u003c/summary\u003e\n\n## Indexer sub-command\n\nThe `indexer` sub-command runs an indexer pipeline configured in a config file. This is usually used when you have a lot of data to vectorize and want to run it in a batch.\n\n```bash\nuv run --directory src docs2vecs indexer --help\n\nusage: docs2vecs indexer [-h] --config CONFIG [--env ENV]\noptions:\n--config CONFIG  Path to the YAML configuration file.\n--env ENV        Environment file to load.\n```\n\nThe `indexer` takes in input two arguments: a **mandatory** config file, and an **optional** environment file.\n\nIn the config file you'll need to define a list of skills, a skillset, and an indexer. Note that you may define plenty of skills, but only those enumerated in the skillset will be executed in sequence.\n\nExample:\n\n```bash\nuv run --directory src docs2vecs indexer --config ~/Downloads/sw_export_temp/config/confluence_process.yml --env ~/indexer.env\n```\n\n**Please check the [detailed skills documentation](docs/readme/indexer-skills.md).**\n\nThe config yaml file is validated against [this schema](./src/docs2vecs/subcommands/indexer/config/config_schema.yaml).\n\nPlease check [sample config file 1](docs/readme/sample-config-file-1.yml), [sample config file 2](docs/readme/sample-config-file-2.yml) for your reference.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\u003csummary\u003eExpand me if you would like to find out how to chat with your data\u003c/summary\u003e\n\n## Server sub-command\n\nIf you previously indexed your data (refer to the previous section) and stored the outputted embeddings in a local ChromaDB, you can chat with your data using the `server` sub-command.\n\n```bash\nuv run --directory src docs2vecs server --help\n\nusage: docs2vecs server [-h] [--host HOST] [--port PORT] [--model MODEL] [--cache_dir CACHE_DIR] [--path PATH]\n                        [--workers WORKERS] [--log_level LOG_LEVEL] [--env ENV]\n\noptions:\n  -h, --help            show this help message and exit\n  --host HOST           A host for the server.\n  --port PORT           A port for the server.\n  --model MODEL         A name of the embedding model(as per huggingface coordinates).\n  --cache_dir CACHE_DIR\n                        A path to the cache directory.\n  --path PATH           A path for the server.\n  --workers WORKERS     Number of workers for the server.\n  --log_level LOG_LEVEL\n                        Log level for the server.\n  --env ENV             Environment file to load.\n```\nBy default, the host is `localhost` and the port is `8008`.\n\nExample:\n```bash\nuv run --directory src docs2vecs server --path path/to/where/your/chroma/db/is\n```\nBy then typing `http://localhost:8008/` in your browser, you sould be able to see the embedding collections stored in your vector store and perform Knn search based on user query. You can modify the K number of nearest neighbours returned by the semantic search.\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\u003csummary\u003eExpand me if you would like to find out how create an integrated vectorization in Azure\u003c/summary\u003e\n\n## Integrated Vectorization sub-command\n`integrated_vec` - Run an integrated vectorization pipeline configured in a config file.\n\n```bash\nuv run --directory src docs2vecs integrated_vec --help\n\nusage: docs2vecs integrated_vec [-h] --config CONFIG [--env ENV]\noptions:\n--config CONFIG  Path to the YAML configuration file.\n--env ENV        Environment file to load.\n```\n\nExample:\n\n```bash\nuv run --directory src docs2vecs integrated_vec --config ~/Downloads/sw_export_temp/config/config.yaml --env ~/integrated_vec .env\n```\n\nThe config yaml file is validated against [this schema](./src/docs2vecs/subcommands/integrated_vec/config/config_schema.yaml).\n\nConfig `yml` file sample:\n\n```yaml\n---\nintegrated_vec:\n    id: AzureAISearchIndexer\n    skill:\n        type: integrated_vec\n        name: AzureAISearchIntegratedVectorization\n        params:\n            search_ai_api_key: env.AZURE_AI_SEARCH_API_KEY\n            search_ai_endpoint: http://replace.me.with.your.endpoint\n            embedding_endpoint: http://replace.me.with.your.endpoint\n            index_name: your_index_name\n            indexer_name: new_indexer_name\n            skillset_name: new_skillset_name\n            data_source_connection_string: ResourceId=/subscriptions/your_subscription_id/resourceGroups/resource_group_name/providers/Microsoft.Storage/storageAccounts/storage_account_name;\n            data_source_connection_name: new_connection_name\n            encryption_key: env.AZURE_AI_SEARCH_ENCRYPTION_KEY\n            container_name: your_container_name\n\n```\n\u003c/details\u003e\n\n## Important note:\nPlease note that **api keys** should **NOT** be stored in config files, and should **NOT** be added to `git`. Therefore, if you build your config file, use the `env.` prefix for `api_key` parameter. For example: `api_key: env.AZURE_AI_SEARCH_API_KEY`.\n\nMake sure you export the environment variables before you run the indexer. For convenience you can use the `--env` argument to supply your own `.env` file.\n\nGenerate and use Scroll Word Exporter API tokens from the Personal Settings section of your Confluence profile.\n\n## Experimental features\n\u003cdetails\u003e\u003csummary\u003eTracker\u003c/summary\u003e\n\n### Tracker\n\nThe tracker feature allows you to monitor and manage the status of documents processed by the indexer. This is particularly useful for tracking failed documents and retrying their processing.\n\nTo achieve this, the tracker needs a `MongoDB` connection, which can be defined in the input config file.\n\nThe way it works is that each document in `MongoDB` has a `chunk` part having a `document_id`. This `document_id` is actually the hash of the content for that chunk. So, as long as the content is the same, the hash will stay the same. Besides this, there is a `status` property that keeps track whether the upload to vector store was successful or not.\n\nIf you'd like to use a different database to keep track of this, you'll have to write your own \"driver\" similar to the existing [mongodb](./src/docs2vecs/subcommands/indexer/db/mongodb.py). Then you need to add it to the [DBFactory](./src/docs2vecs/subcommands/indexer/skills/factory.py).\n\u003c/details\u003e\n\n# Development\n\nTo run tests with pytest:\n\n    uv python install 3.11\n    uv sync --all-extras --dev\n    uv run pytest tests\n\n\nIt is also possible to use tox::\n    \n    uv pip install tox\n    uv run tox\n\nNote, to combine the coverage data from all the tox environments run:\n\n| OS      | Command                            |\n| :---    | :---                                |\n| Windows | `set PYTEST_ADDOPTS=--cov-append tox`   |\n| Other   | `PYTEST_ADDOPTS=--cov-append tox`       |\n\n# Releasing\nTo release a new version of the package, you can create a pre-release from the main branch using GitHub UI, which will then trigger the release workflow. Alternatively, you can use the `gh` command line tool to create a release:\n\n```bash\ngh release create v[a.b.c] --prerelease --title \"Kick starting the release\"  --target main\n```\n\n# Contributing\nWe welcome contributions to the `docs2vecs` project! If you have an idea for a new feature, bug fix, or improvement, please open an issue or submit a pull request. Before contributing, please read our [contributing guidelines](./CONTRIBUTING.md).","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famadeusitgroup%2Fdocs2vecs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famadeusitgroup%2Fdocs2vecs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famadeusitgroup%2Fdocs2vecs/lists"}