{"id":22788463,"url":"https://github.com/astrabert/sentrev","last_synced_at":"2025-04-16T01:34:33.858Z","repository":{"id":264481382,"uuid":"893508815","full_name":"AstraBert/SenTrEv","owner":"AstraBert","description":"Simple customizable evaluation for text retrieval performance of Sentence Transformers embedders on PDFs","archived":false,"fork":false,"pushed_at":"2025-01-20T22:24:18.000Z","size":2644,"stargazers_count":25,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-12T14:05:19.147Z","etag":null,"topics":["embedders","evaluation-framework","python","python-package","qdrant","semantic-search","sentence-transformers","text-embedding","vector-database"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/sentrev/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AstraBert.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-24T16:19:08.000Z","updated_at":"2025-04-11T05:33:36.000Z","dependencies_parsed_at":"2025-01-20T23:21:41.403Z","dependency_job_id":"3968c199-ed87-4508-99a8-9efd53e3227f","html_url":"https://github.com/AstraBert/SenTrEv","commit_stats":null,"previous_names":["astrabert/sentrev"],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FSenTrEv","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FSenTrEv/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FSenTrEv/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FSenTrEv/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AstraBert","download_url":"https://codeload.github.com/AstraBert/SenTrEv/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249182579,"owners_count":21226091,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embedders","evaluation-framework","python","python-package","qdrant","semantic-search","sentence-transformers","text-embedding","vector-database"],"created_at":"2024-12-12T01:31:29.915Z","updated_at":"2025-04-16T01:34:33.840Z","avatar_url":"https://github.com/AstraBert.png","language":"Python","funding_links":["https://github.com/sponsors/AstraBert"],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\r\n\u003ch1\u003eSenTrEv\u003c/h1\u003e\r\n\u003ch2\u003eSimple evaluation for dense and sparse retrieval on your documents\u003c/h2\u003e\r\n\u003c/div\u003e\r\n\u003cbr\u003e\r\n\u003cdiv align=\"center\"\u003e\r\n    \u003cimg src=\"https://raw.githubusercontent.com/AstraBert/SenTrEv/main/logo.png\" alt=\"SenTrEv Logo\"\u003e\r\n    \u003cbr\u003e\r\n    \u003cbr\u003e\r\n    \u003ca href=\"https://doi.org/10.5281/zenodo.14583071\"\u003e\u003cimg src=\"https://zenodo.org/badge/DOI/10.5281/zenodo.14583071.svg\" alt=\"DOI\"\u003e\u003c/a\u003e\r\n\u003c/div\u003e\r\n\u003cbr\u003e\r\n\r\n**SenTrEv** (**Sen**tence **Tr**ansformers **Ev**aluator) is a python package that is aimed at running simple evaluation tests to help you choose the best embedding model for Retrieval Augmented Generation (RAG) with your text-based documents.\r\n\r\n### Applicability\r\n\r\nSenTrEv works with:\r\n\r\n- **Dense** text encoders/embedders loaded through the class `SentenceTransformer` in the python package [`sentence_transformers`](https://sbert.net/)\r\n- **Sparse** text encoders/embedders loaded through the class `SparseTextEmbeddings` in the python package [`fastembed`](https://pypi.org/project/fastembed)\r\n- PDF, PPTX, DOCX, HTML, CSV and XML documents (single and multiple uploads supported)\r\n- [Qdrant](https://qdrant.tech) vector databases (both local and on cloud)\r\n\r\n### Installation\r\n\r\nYou can install the package using `pip` (**easier but no customization**):\r\n\r\n```bash\r\npython3 -m pip install sentrev\r\n```\r\n\r\nOr you can build it from the source code (**more difficult but customizable**):\r\n\r\n```bash\r\n# clone the repo\r\ngit clone https://github.com/AstraBert/SenTrEv.git\r\n# access the repo\r\ncd SenTrEv\r\n# build the package\r\npython3 -m build\r\n# install the package locally with editability settings\r\npython3 -m pip install -e .\r\n```\r\n\r\n### Evaluation process\r\n\r\nSenTrEv applies a very simple evaluation workflow:\r\n\r\n1. After the PDF text extraction and chunking phase, the chunks are reduced according to a (optionally) user-defined percentage (default is 25%), which is randomly extracted at any point of each chunk.\r\n2. The reduced chunks are mapped to their original ones in a dictionary\r\n3. Each model encodes the original chunks and uploads the vectors to the Qdrant vector storage\r\n4. The reduced chunks are then used as queries for dense retrieval\r\n5. Starting from retrieval results, accuracy, time and carbon emissions statistics are calculated and plotted.\r\n\r\nSee the figure below for a visualization of the workflow\r\n\r\n![workflow](https://raw.githubusercontent.com/AstraBert/SenTrEv/main/workflow.png)\r\n\r\nThe metrics used to evaluate performance were:\r\n\r\n- **Success rate**: defined as the number retrieval operation in which the correct context was retrieved ranking top among all the retrieved contexts, out of the total retrieval operations:\r\n\r\n  $SR = \\frac{Ncorrect}{Ntot}$ (eq.1)\r\n\r\n- **Mean Reciprocal Ranking (MRR)**: MRR defines how high in ranking the correct context is placed among the retrieved results. MRR@10 was used, meaning that for each retrieval operation 10 items were returned and an evaluation was carried out for the ranking of the correct context, which was then normalized between 0 and 1 (already implemented in SenTrEv). An MRR of 1 means that the correct context was ranked first, whereas an MRR of 0 means that it wasn't retrieved. MRR is calculated with the following general equation:\r\n\r\n  $MRR = \\frac{ranking + Nretrieved - 1}{Nretrieved}$ (eq.2)\r\n\r\n  When the correct context is not retrieved, MRR is automatically set to 0. MRR is calculated for each retrieval operation, then the average and standard deviation are calculated and reported.\r\n- **Precision**: number of relevant documents out of the total number of retrieved documents. The relevance of the document is evaluated based on the \"page\" metadata entry: it the retrieved document comes from the same page of the query, the document is considered relevant.\r\n- **Non-Relevant Ratio**: number of non-relevant documents out of the total number of retrieved documents. Relevance is evaluated as explained in the previous point.\r\n- **Time performance**: for each retrieval operation the time performance in seconds is calculated: the average and standard deviation are then reported.\r\n- **Carbon emissions**: Carbon emissions are calculated in gCO2eq (grams of CO2 equivalent) through the Python library [`codecarbon`](https://codecarbon.io/) and were evaluated for the Austrian region. They are reported for the global computational load of all the retrieval operations.\r\n\r\n### Use cases\r\n\r\n#### 1. Local Qdrant\r\n\r\nYou can easily run Qdrant locally with Docker:\r\n\r\n```bash\r\ndocker pull qdrant/qdrant:latest\r\ndocker run -p 6333:6333 -p 6334:6334 qdrant/qdrant:latest\r\n```\r\n\r\nNow your vector database is listening at `http://localhost:6333`\r\n\r\nLet's say we have several text-based files (`~/data/attention_is_all_you_need.pdf`, `~/data/generative_adversarial_nets.pdf`, `~/data/narration.docx`, `~/data/call-to-action.html`, `~/data/test.xml`) and we want to test dense retrieval with three different encoders (`sentence-transformers/all-MiniLM-L6-v2` , `sentence-transformers/sentence-t5-base`, `sentence-transformers/all-mpnet-base-v2`) and sparse retrieval with three others (`Qdrant/bm25`, `prithivida/Splade_PP_en_v1`, `Qdrant/bm42-all-minilm-l6-v2-attentions`)\r\n\r\nWe can do it with this very simple code:\r\n\r\n```python\r\nfrom sentence_transformers import SentenceTransformer\r\nfrom qdrant_client import QdrantClient\r\nfrom fastembed import SparseTextEmbedding\r\nfrom sentrev.evaluator import evaluate_dense_retrieval, evaluate_sparse_retrieval\r\nimport os\r\n\r\n# Load all the dense embedding models\r\nencoder3 = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', device=\"cuda\")\r\nencoder5 = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2', device=\"cuda\")\r\nencoder6 = SentenceTransformer('sentence-transformers/LaBSE', device=\"cuda\")\r\n\r\n# Create a list of the dense encoders\r\nencoders = [encoder3, encoder5, encoder6]\r\n\r\n# Create a dictionary that maps each encoder to its name\r\nencoder_to_names = {\r\n    encoder3: 'all-mpnet-base-v2',\r\n    encoder5: 'all-MiniLM-L12-v2',\r\n    encoder6: 'LaBSE',\r\n}\r\n\r\n# Collect data\r\npdfs = [\"~/data/attention_is_all_you_need.pdf\", \"~/data/generative_adversarial_nets.pdf\", \"~/data/narration.docx\", \"~/data/call-to-action.html\", \"~/data/test.xml\"]\r\n\r\n# Create Qdrant client\r\nclient = QdrantClient(\"http://localhost:6333\")\r\n\r\n# Set distances\r\ndistances = [\"cosine\", \"dot\", \"euclid\", \"manhattan\"]\r\n\r\n# Loop through different chunking_size, text_percentage and distance options\r\nfor chunking_size in range(500,2000,500):\r\n    for text_percentage in range(40, 100, 20):\r\n        perc = text_percentage/100\r\n        for distance in distances:\r\n            os.makedirs(f\"dense_eval/{chunking_size}_{text_percentage}_{distance}/\")\r\n            csv_path = f\"dense_eval/{chunking_size}_{text_percentage}_{distance}/stats.csv\"\r\n            evaluate_dense_retrieval(pdfs, encoders, encoder_to_names, client, csv_path, chunking_size, text_percentage=perc, distance=distance, mrr=10, carbon_tracking=\"AUT\", plot=True)\r\n\r\n# Load all the sparse embedding models\r\nsparse_encoder1 = SparseTextEmbedding(\"Qdrant/bm25\")\r\nsparse_encoder2 = SparseTextEmbedding(\"prithivida/Splade_PP_en_v1\")\r\nsparse_encoder3 = SparseTextEmbedding(\"Qdrant/bm42-all-minilm-l6-v2-attentions\")\r\n\r\n# Create a list of the sparse encoders\r\nsparse_encoders = [sparse_encoder1, sparse_encoder2, sparse_encoder3]\r\n\r\n# Create a dictionary that maps each sparse encoder to its name\r\nsparse_encoder_to_names = {\r\n    sparse_encoder1: 'BM25',\r\n    sparse_encoder2: 'Splade',\r\n    sparse_encoder3: 'BM42',\r\n}\r\n\r\n# Loop through different chunking_size, text_percentage and distance options\r\nfor chunking_size in range(500,2000,500):\r\n    for text_percentage in range(40, 100, 20):\r\n        perc = text_percentage/100\r\n        for distance in distances:\r\n            os.makedirs(f\"sparse_eval/{chunking_size}_{text_percentage}_{distance}/\")\r\n            csv_path = f\"sparse_eval/{chunking_size}_{text_percentage}_{distance}/stats.csv\"\r\n            evaluate_sparse_retrieval(pdfs, sparse_encoders, sparse_encoder_to_names, client, csv_path, chunking_size, text_percentage=perc, distance=distance, mrr=10, carbon_tracking=\"AUT\", plot=True)\r\n```\r\n \r\nYou can play around with the chunking of your PDF by setting the `chunking_size` argument or with the percentage of text used to test retrieval by setting `text_percentage` or with the distance metric used for retrieval by setting the `distance` argument or with the `mrr` settings by tuning the number of retrieved items (in this case 10); you can also pass `plot=True` if you want plots for the evaluation: plots will be saved under the same folder of the CSV file; if you want to turn on carbon emissions tracking, you can use the `carbon_tracking` option followed by the three-letters ISO code of the State you are in.\r\n\r\n#### 2. On-cloud Qdrant\r\n\r\nYou can also exploit Qdrant on-cloud database solutions (more about it [here](https://qdrant.tech)). You just need your Qdrant cluster URL and the API key to access it:\r\n\r\n```python\r\nfrom qdrant_client import QdrantClient\r\n\r\nclient = QdrantClient(url=\"YOUR-QDRANT-URL\", api_key=\"YOUR-API-KEY\")\r\n```\r\n\r\nThis is the only change you have to make to the code provided in the example before.\r\n\r\n#### 3. Upload PDFs to Qdrant\r\n\r\nYou can use SenTrEv also to chunk, vectorize and upload your PDFs to a Qdrant database.\r\n\r\n```python\r\nfrom sentrev.evaluator import upload_pdfs\r\n\r\nencoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')\r\npdfs = ['~/pdfs/instructions.pdf', '~/pdfs/history.pdf', '~/pdfs/info.pdf']\r\nclient = QdrantClient(\"http://localhost:6333\")\r\n\r\nupload_pdfs(pdfs=pdfs, encoder=encoder, client=client)\r\n```\r\n\r\nAs for before, you can also play around with the `chunking_size` argument (default is 1000) and with the `distance` argument (default is cosine).\r\n\r\nYou can also upload PDFs to a sparse collection:\r\n\r\n```python\r\nfrom sentrev.evaluator import upload_pdfs_sparse\r\nfrom fastembed import SparseTextEmbedding\r\n\r\nsparse_encoder1 = SparseTextEmbedding(\"Qdrant/bm25\")\r\npdfs = ['~/pdfs/instructions.pdf', '~/pdfs/history.pdf', '~/pdfs/info.pdf']\r\nclient = QdrantClient(\"http://localhost:6333\")\r\n\r\nupload_pdfs_sparse(pdfs=pdfs, encoder=None, sparse_encoder=, client=client)\r\n```\r\n\r\nYou can also load other documents that are not PDFs, upon conversion to PDFs:\r\n\r\n```python\r\nfrom sentrev.evaluator import to_pdf\r\n\r\nfiles = ['~/pdfs/instructions.md', '~/pdfs/history.docx', '~/pdfs/info.html', '~/pdfs/info.xml']\r\npdfs = to_pdf(files)\r\n```\r\n\r\n#### 4. Implement semantic search on a Qdrant collection\r\n\r\nYou can also search already-existent collections in a Qdrant database with SenTrEv:\r\n\r\n```python\r\nfrom sentrev.utils import NeuralSearcher\r\n\r\nencoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')\r\ncollection_name = 'customer_help'\r\nclient = QdrantClient(\"http://localhost:6333\")\r\n\r\nsearcher = NeuralSearcher(client=client, model=encoder, collection_name=collection_name)\r\nres = searcher.search(\"Is it possible to pay online with my credit card?\", limit=5)\r\n```\r\n\r\nIf your collection is of a sparse type, you can use this code:\r\n\r\n```python\r\nfrom sentrev.utils import NeuralSearcher\r\nfrom fastembed import SparseTextEmbedding\r\n\r\nsparse_encoder1 = SparseTextEmbedding(\"Qdrant/bm25\")\r\n\r\ncollection_name = 'customer_help'\r\nclient = QdrantClient(\"http://localhost:6333\")\r\n\r\nsearcher = NeuralSearcher(client=client, model=None, collection_name=collection_name)\r\nres = searcher.search_sparse(\"Is it possible to pay online with my credit card?\", sparse_encoder1, limit=5)\r\n```\r\n\r\nThe results will be returned as a list of payloads (the metadata you uploaded to the Qdrant collection along with the vector points).\r\n\r\nIf you used SenTrEv `upload_pdfs`/`upload_pdfs_sparse` function, you should be able to access the results in this way:\r\n\r\n```python\r\ntext = res[0][\"text\"]\r\nsource = res[0][\"source\"]\r\npage = res[0][\"page\"]\r\n```\r\n\r\n### Case Study\r\n\r\nYou can refer to the test case reported [here](https://github.com/AstraBert/SenTrEv/tree/main/CaseStudy.pdf)\r\n\r\n### Reference\r\n\r\nFind a reference for all the functions and classes [here](https://github.com/AstraBert/SenTrEv/tree/main/REFERENCE.md)\r\n\r\n\r\n### Contributing\r\n\r\nContributions are always welcome!\r\n\r\nFind contribution guidelines at [CONTRIBUTING.md](https://github.com/AstraBert/SenTrEv/tree/main/CONTRIBUTING.md)\r\n\r\n### License, Citation and Funding\r\n\r\nThis project is open-source and is provided under an [MIT License](https://github.com/AstraBert/SenTrEv/tree/main/LICENSE).\r\n\r\nIf you used **SenTrEv**, please cite:\r\n\r\n_Bertelli, A. C. (2024). SenTrEv - Simple evaluation for dense and sparse retrieval on your documents (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.14583071_\r\n\r\nIf you found it useful, please consider [funding it](https://github.com/sponsors/AstraBert) .\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrabert%2Fsentrev","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastrabert%2Fsentrev","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrabert%2Fsentrev/lists"}