{"id":22261666,"url":"https://github.com/marplex/mcdse","last_synced_at":"2025-10-13T12:31:46.579Z","repository":{"id":259730629,"uuid":"877189741","full_name":"Marplex/mcdse","owner":"Marplex","description":"Multilingual model for OCR-free document retrieval","archived":false,"fork":false,"pushed_at":"2024-10-28T09:37:52.000Z","size":54,"stargazers_count":4,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-06T10:36:55.026Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Marplex.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-23T08:40:50.000Z","updated_at":"2025-01-23T17:41:18.000Z","dependencies_parsed_at":"2025-01-30T12:37:25.859Z","dependency_job_id":null,"html_url":"https://github.com/Marplex/mcdse","commit_stats":null,"previous_names":["marplex/mcdse"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Marplex/mcdse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Marplex%2Fmcdse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Marplex%2Fmcdse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Marplex%2Fmcdse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Marplex%2Fmcdse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Marplex","download_url":"https://codeload.github.com/Marplex/mcdse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Marplex%2Fmcdse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279015056,"owners_count":26085643,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-03T09:13:44.091Z","updated_at":"2025-10-13T12:31:46.260Z","avatar_url":"https://github.com/Marplex.png","language":"Python","readme":"![](art/cover_wide.png)\n\n**mcdse-2b-v1** is a new experimental multilingual model for OCR-free document retrieval.\n\nThis model allows you to embed page/slide screenshots and query them using natural language. Tables, graphs, charts, schemas, images and text are \"automagically\" encoded for you into a single embedding vector. No need to worry about OCR, document layout analysis, reading order detection, table/formula extraction...\n\n- **Understands 🇮🇹 Italian, 🇪🇸 Spanish, 🇬🇧 English, 🇫🇷 French and 🇩🇪 German**\n\n- **Matryoshka Representation Learning:** shrink embeddings from 1536 to 256 dimensions while maintaining 95% of the quality. A 6x reduction with negligible impact on performance!\n\n- **Top-tier Binarization**: 768-dimensional binary vectors retain 99% retrieval quality of the original 1536-dimensional float vectors. With binary vectors, you can encode **100 million multilingual pages in just 10GB**.\n\n- **Fast vLLM inference:** run inference on vLLM and efficiently serve embeddings at scale, production ready.\n\nFor more information about this model or how it was trained, visit the [announcement blogpost](https://huggingface.co/blog/marco/announcing-mcdse-2b-v1).\n\n## Evaluations\nGiven the scarcity of publicly available datasets for multilingual document image retrieval, the model has been evaluated using a custom-built dataset. This eval dataset was specifically designed to benchmark the model's performance across various languages.\n\n### NDCG@5 (float)\n|                     | Average    | English    | Italian    | Spanish    | French     | German     |\n|---------------------|------------|------------|------------|------------|------------|------------|\n| **1536 dimensions** |            |            |            |            |            |            |\n| dse-qwen2-2b-mrl-v1 |       79.5 |       79.2 |       80.2 |       77.9 |       80.6 |       79.6 |\n| mcdse-2b-v1         |   **82.2** |   **80.8** |   **81.2** |   **80.7** |   **84.5** |   **83.8** |\n|                     | **+3.28%** | **+1.98%** | **+1.23%** | **+3.47%** | **+4.62%** | **+5.01%** |\n| **1024 dimensions** |            |            |            |            |            |            |\n| dse-qwen2-2b-mrl-v1 |       78.3 |       78.8 |       78.5 |       76.5 |         80 |       77.5 |\n| mcdse-2b-v1         |   **81.7** |     **80** |   **80.2** |   **80.1** |     **84** |   **84.3** |\n|                     | **+4.23%** | **+1.75%** | **+2.12%** | **+4.49%** | **+4.76%** | **+8.07%** |\n| **768 dimensions**  |            |            |            |            |            |            |\n| dse-qwen2-2b-mrl-v1 |       77.8 |       78.4 |       78.3 |       75.6 |       80.8 |       75.9 |\n| mcdse-2b-v1         |   **81.1** |   **79.6** |   **79.9** |   **79.2** |   **83.3** |   **83.3** |\n|                     | **+4.02%** | **+1.51%** | **+2.00%** | **+4.55%** | **+3.00%** | **+8.88%** |\n| **512 dimensions**  |            |            |            |            |            |            |\n| dse-qwen2-2b-mrl-v1 |       76.2 |       77.6 |       75.9 |       73.1 |       79.2 |       75.2 |\n| mcdse-2b-v1         |   **79.3** |   **78.5** |   **79.1** |   **75.8** |   **81.4** |   **81.7** |\n|                     | **+3.91%** | **+1.15%** | **+4.05%** | **+3.56%** | **+2.70%** | **+7.96%** |\n| **384 dimensions**  |            |            |            |            |            |            |\n| dse-qwen2-2b-mrl-v1 |       75.7 |       76.2 |       75.5 |       74.6 |       78.4 |         74 |\n| mcdse-2b-v1         |   **78.8** |   **77.5** |   **78.5** |   **76.1** |   **80.4** |   **81.4** |\n|                     | **+3.86%** | **+1.68%** | **+3.82%** | **+1.97%** | **+2.49%** | **+9.09%** |\n| **256 dimensions**  |            |            |            |            |            |            |\n| dse-qwen2-2b-mrl-v1 |       73.5 |       74.5 |       73.6 |       70.6 |       74.8 |       73.8 |\n| mcdse-2b-v1         |   **78.1** |   **78.5** |   **77.6** |   **76.2** |   **80.1** |   **77.9** |\n|                     | **+5.89%** | **+5.10%** | **+5.15%** | **+7.35%** | **+6.62%** | **+5.26%** |\n\n### NDCG@5 (binary)\n|                     | Average     | English     | Italian     | Spanish     | French      | German      |\n|---------------------|-------------|-------------|-------------|-------------|-------------|-------------|\n| **1536 dimensions** |             |             |             |             |             |             |\n| dse-qwen2-2b-mrl-v1 |        75.0 |        75.8 |        75.4 |        72.4 |        78.1 |        73.2 |\n| mcdse-2b-v1         |    **80.6** |    **79.5** |    **76.9** |    **81.9** |    **83.7** |    **80.8** |\n|                     |  **+6.93%** |  **+4.65%** |  **+1.95%** | **+11.60%** |  **+6.69%** |  **+9.41%** |\n| **1024 dimensions** |             |             |             |             |             |             |\n| dse-qwen2-2b-mrl-v1 |        72.2 |        74.8 |          71 |        70.8 |        74.6 |        69.6 |\n| mcdse-2b-v1         |    **79.3** |    **78.4** |    **75.4** |    **80.8** |    **82.6** |    **79.5** |\n|                     |  **+9.05%** |  **+4.59%** |  **+5.84%** | **+12.38%** |  **+9.69%** | **+12.45%** |\n| **768 dimensions**  |             |             |             |             |             |             |\n| dse-qwen2-2b-mrl-v1 |        70.1 |        71.7 |        69.3 |        69.8 |        73.7 |        65.9 |\n| mcdse-2b-v1         |    **78.8** |    **77.1** |    **75.4** |      **80** |      **83** |    **78.5** |\n|                     | **+11.07%** |  **+7.00%** |  **+8.09%** | **+12.75%** | **+11.20%** | **+16.05%** |\n| **512 dimensions**  |             |             |             |             |             |             |\n| dse-qwen2-2b-mrl-v1 |        66.5 |          70 |        65.4 |        63.7 |        70.2 |          63 |\n| mcdse-2b-v1         |    **76.6** |    **74.8** |    **74.2** |    **77.7** |    **80.9** |    **75.3** |\n|                     | **+13.21%** |  **+6.42%** | **+11.86%** | **+18.02%** | **+13.23%** | **+16.33%** |\n| **384 dimensions**  |             |             |             |             |             |             |\n| dse-qwen2-2b-mrl-v1 |        61.1 |        62.7 |        58.5 |        58.6 |        65.1 |        60.8 |\n| mcdse-2b-v1         |    **74.3** |    **74.5** |    **71.4** |    **77.2** |    **75.2** |      **73** |\n|                     | **+17.67%** | **+15.84%** | **+18.07%** | **+24.09%** | **+13.43%** | **+16.71%** |\n| **256 dimensions**  |             |             |             |             |             |             |\n| dse-qwen2-2b-mrl-v1 |        54.3 |          59 |        56.5 |        53.6 |          53 |        49.6 |\n| mcdse-2b-v1         |    **70.9** |    **72.6** |    **66.4** |    **73.5** |    **72.6** |    **69.2** |\n|                     | **+23.31%** | **+18.73%** | **+14.91%** | **+27.07%** | **+27.00%** | **+28.32%** |\n\n\n\n## vLLM\nThis repo implements a new model class `Qwen2VLForEmbeddingGeneration` to support embedding generation with Qwen2VL models.\n\n### Download mcdse-2b-v1 for local inference\n```python\nfrom huggingface_hub import snapshot_download\nsnapshot_download(repo_id=\"marco/mcdse-2b-v1\", local_dir=\"/path/to/model/mcdse-2b-v1\")\n```\n\n### Edit config.json\nReplace `Qwen2VLForConditionalGeneration` with `Qwen2VLForEmbeddingGeneration`\n```bash\nsed -i -e 's/Qwen2VLForConditionalGeneration/Qwen2VLForEmbeddingGeneration/g' /path/to/model/mcdse-2b-v1/config.json\n```\n\n### Open `vllm/main.py` for usage instructions","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarplex%2Fmcdse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmarplex%2Fmcdse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarplex%2Fmcdse/lists"}