{"id":20478691,"url":"https://github.com/do-me/cordis-semantic-search","last_synced_at":"2026-05-28T21:31:23.762Z","repository":{"id":199498891,"uuid":"703020429","full_name":"do-me/cordis-semantic-search","owner":"do-me","description":"A simple semantic search application for CORDIS running entirely in the browser","archived":false,"fork":false,"pushed_at":"2023-11-25T10:31:07.000Z","size":39056,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-05T15:14:05.254Z","etag":null,"topics":["cordis","semantic-search","semanticsearch","transformers"],"latest_commit_sha":null,"homepage":"https://do-me.github.io/cordis-semantic-search/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/do-me.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-10T12:52:46.000Z","updated_at":"2024-06-25T11:35:29.000Z","dependencies_parsed_at":"2023-11-25T11:36:59.105Z","dependency_job_id":null,"html_url":"https://github.com/do-me/cordis-semantic-search","commit_stats":null,"previous_names":["do-me/cordis-semantic-search"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/do-me/cordis-semantic-search","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fcordis-semantic-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fcordis-semantic-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fcordis-semantic-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fcordis-semantic-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/do-me","download_url":"https://codeload.github.com/do-me/cordis-semantic-search/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fcordis-semantic-search/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33627934,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cordis","semantic-search","semanticsearch","transformers"],"created_at":"2024-11-15T15:38:42.612Z","updated_at":"2026-05-28T21:31:23.748Z","avatar_url":"https://github.com/do-me.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CORDIS semantic search\n\n### Intro\nA basic semantic search app based on 133.952 public pdfs (~400GB) from [CORDIS](https://cordis.europa.eu/search/en) chunked and indexed (mean embedding of all chunks) in a ~38MB gzipped json with [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).\nApp loads ~50Mb of resources of data and scripts. Data cutoff in 2022.\n\n### Architecture \nThe app loads a gzipped json with a filename referring to the downloaded pdf files from CORDIS and the vectors consisting of 384 dimensions: \n\n|| filename| mean_embedding |\n|-:|:-|:-|\n|  0 | project_rcn_229984_projectDeliverable_webLinkId_c314060ff50aa63cf69787e20ae3776e.pdf | [-0.02,..., -0.03]     |\n|  1 | project_rcn_211323_projectDeliverable_webLinkId_973e210a8393dd1e82ab26ae5f1fcc55.pdf | [0.02, ..., 0.0]     |\n|  2 | project_rcn_211567_projectDeliverable_webLinkId_92ee89e81e18ca78c510f7d3a41a0cef.pdf | [-0.04, ..., -0.02]   |\n|  3 | project_rcn_206371_projectDeliverable_webLinkId_18c997f51b451d2653e5b4e821ce2b8f.pdf | [-0.04,..., 0.02]             |\n|  4 | project_rcn_229098_projectDeliverable_webLinkId_e67766b20e28a7215683a66666933a64.pdf | [0.01,..., 0.02] |\n\n- The static web app parses the filename and translates it to URLs where possible.\n- The floats in the vector are trimmed to 2 decimals based on empiric trials. The search is not intended to deliver accurately ranked results but rather return the most related ones, e.g. top 20 which works pretty well. The same file with 3 decimal places per float would have ~80MB while the one with all decimal places (default precision with sentence transformers) would lead to a file with 1.2GB which isn't feasible for a static web app. An alternative approach with product quantization is beeing explored.\n- Uses indexDB to cache the ~38MB gzipped json in the browser, so consecutive site calls are fast. \n\n### Packages used \n- [transformers.js](https://github.com/xenova/transformers.js) for in-browser inferencing of the user-query\n- [pako.js](https://github.com/nodeca/pako) for decompressing the gzipped json\n- [bootstrap](https://getbootstrap.com/) for basic styling\n\n### Data inspection \nIf you'd like to inspect the data pandas offers automatic decompression:\n\n```python\nimport pandas as pd \ndf = pd.read_json(\"filename_mean_embedding_prec_2_records.json.gz\")\ndf\n```\n\n### Future ideas \n- Use better embeddings models from MTEB leaderboard like bge-base\n- Use parquet instead of gzipped json, might boost read times\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdo-me%2Fcordis-semantic-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdo-me%2Fcordis-semantic-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdo-me%2Fcordis-semantic-search/lists"}