https://github.com/marplex/mcdse

Multilingual model for OCR-free document retrieval
https://github.com/marplex/mcdse
Last synced: 5 months ago
JSON representation
Multilingual model for OCR-free document retrieval
Host: GitHub
URL: https://github.com/marplex/mcdse
Owner: Marplex
License: mit
Created: 2024-10-23T08:40:50.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-10-28T09:37:52.000Z (over 1 year ago)
Last Synced: 2025-04-06T10:36:55.026Z (11 months ago)
Language: Python
Size: 52.7 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          ![](art/cover_wide.png)

**mcdse-2b-v1** is a new experimental multilingual model for OCR-free document retrieval.

This model allows you to embed page/slide screenshots and query them using natural language. Tables, graphs, charts, schemas, images and text are "automagically" encoded for you into a single embedding vector. No need to worry about OCR, document layout analysis, reading order detection, table/formula extraction...

- **Understands 🇮🇹 Italian, 🇪🇸 Spanish, 🇬🇧 English, 🇫🇷 French and 🇩🇪 German**

- **Matryoshka Representation Learning:** shrink embeddings from 1536 to 256 dimensions while maintaining 95% of the quality. A 6x reduction with negligible impact on performance!

- **Top-tier Binarization**: 768-dimensional binary vectors retain 99% retrieval quality of the original 1536-dimensional float vectors. With binary vectors, you can encode **100 million multilingual pages in just 10GB**.

- **Fast vLLM inference:** run inference on vLLM and efficiently serve embeddings at scale, production ready.

For more information about this model or how it was trained, visit the [announcement blogpost](https://huggingface.co/blog/marco/announcing-mcdse-2b-v1).

## Evaluations

Given the scarcity of publicly available datasets for multilingual document image retrieval, the model has been evaluated using a custom-built dataset. This eval dataset was specifically designed to benchmark the model's performance across various languages.

### NDCG@5 (float) 
| 
|---------------------| 
| **1536 dimensions** | 
| dse-qwen2-2b-mrl-v1 | 
| mcdse-2b-v1 
| 
| **1024 dimensions** | 
| dse-qwen2-2b-mrl-v1 | 
| mcdse-2b-v1 
| 
| **768 dimensions**  | 
| dse-qwen2-2b-mrl-v1 | 
| mcdse-2b-v1 
| 
| **512 dimensions**  | 
| dse-qwen2-2b-mrl-v1 | 
| mcdse-2b-v1 
| 
| **384 dimensions**  | 
| dse-qwen2-2b-mrl-v1 | 
| mcdse-2b-v1 
| 
| **256 dimensions**  | 
| dse-qwen2-2b-mrl-v1 | 
| mcdse-2b-v1 
|

| Average    | English    | Italian    | Spanish    | French     | German     | ------------|------------|------------|------------|------------|------------| |            |            |            |            |            | 79.5 |       79.2 |       80.2 |       77.9 |       80.6 |       79.6 | |   **82.2** |   **80.8** |   **81.2** |   **80.7** |   **84.5** |   **83.8** | | **+3.28%** | **+1.98%** | **+1.23%** | **+3.47%** | **+4.62%** | **+5.01%** | |            |            |            |            |            | 78.3 |       78.8 |       78.5 |       76.5 |         80 |       77.5 | |   **81.7** |     **80** |   **80.2** |   **80.1** |     **84** |   **84.3** | | **+4.23%** | **+1.75%** | **+2.12%** | **+4.49%** | **+4.76%** | **+8.07%** | |            |            |            |            |            | 77.8 |       78.4 |       78.3 |       75.6 |       80.8 |       75.9 | |   **81.1** |   **79.6** |   **79.9** |   **79.2** |   **83.3** |   **83.3** | | **+4.02%** | **+1.51%** | **+2.00%** | **+4.55%** | **+3.00%** | **+8.88%** | |            |            |            |            |            | 76.2 |       77.6 |       75.9 |       73.1 |       79.2 |       75.2 | |   **79.3** |   **78.5** |   **79.1** |   **75.8** |   **81.4** |   **81.7** | | **+3.91%** | **+1.15%** | **+4.05%** | **+3.56%** | **+2.70%** | **+7.96%** | |            |            |            |            |            | 75.7 |       76.2 |       75.5 |       74.6 |       78.4 |         74 | |   **78.8** |   **77.5** |   **78.5** |   **76.1** |   **80.4** |   **81.4** | | **+3.86%** | **+1.68%** | **+3.82%** | **+1.97%** | **+2.49%** | **+9.09%** | |            |            |            |            |            | 73.5 |       74.5 |       73.6 |       70.6 |       74.8 |       73.8 | |   **78.1** |   **78.5** |   **77.6** |   **76.2** |   **80.1** |   **77.9** | | **+5.89%** | **+5.10%** | **+5.15%** | **+7.35%** | **+6.62%** | **+5.26%** |

### NDCG@5 (binary)

|                     | Average     | English     | Italian     | Spanish     | French      | German      |

|---------------------|-------------|-------------|-------------|-------------|-------------|-------------|

| **1536 dimensions** |             |             |             |             |             |             |

| dse-qwen2-2b-mrl-v1 |        75.0 |        75.8 |        75.4 |        72.4 |        78.1 |        73.2 |

| mcdse-2b-v1         |    **80.6** |    **79.5** |    **76.9** |    **81.9** |    **83.7** |    **80.8** |

|                     |  **+6.93%** |  **+4.65%** |  **+1.95%** | **+11.60%** |  **+6.69%** |  **+9.41%** |

| **1024 dimensions** |             |             |             |             |             |             |

| dse-qwen2-2b-mrl-v1 |        72.2 |        74.8 |          71 |        70.8 |        74.6 |        69.6 |

| mcdse-2b-v1         |    **79.3** |    **78.4** |    **75.4** |    **80.8** |    **82.6** |    **79.5** |

|                     |  **+9.05%** |  **+4.59%** |  **+5.84%** | **+12.38%** |  **+9.69%** | **+12.45%** |

| **768 dimensions**  |             |             |             |             |             |             |

| dse-qwen2-2b-mrl-v1 |        70.1 |        71.7 |        69.3 |        69.8 |        73.7 |        65.9 |

| mcdse-2b-v1         |    **78.8** |    **77.1** |    **75.4** |      **80** |      **83** |    **78.5** |

|                     | **+11.07%** |  **+7.00%** |  **+8.09%** | **+12.75%** | **+11.20%** | **+16.05%** |

| **512 dimensions**  |             |             |             |             |             |             |

| dse-qwen2-2b-mrl-v1 |        66.5 |          70 |        65.4 |        63.7 |        70.2 |          63 |

| mcdse-2b-v1         |    **76.6** |    **74.8** |    **74.2** |    **77.7** |    **80.9** |    **75.3** |

|                     | **+13.21%** |  **+6.42%** | **+11.86%** | **+18.02%** | **+13.23%** | **+16.33%** |

| **384 dimensions**  |             |             |             |             |             |             |

| dse-qwen2-2b-mrl-v1 |        61.1 |        62.7 |        58.5 |        58.6 |        65.1 |        60.8 |

| mcdse-2b-v1         |    **74.3** |    **74.5** |    **71.4** |    **77.2** |    **75.2** |      **73** |

|                     | **+17.67%** | **+15.84%** | **+18.07%** | **+24.09%** | **+13.43%** | **+16.71%** |

| **256 dimensions**  |             |             |             |             |             |             |

| dse-qwen2-2b-mrl-v1 |        54.3 |          59 |        56.5 |        53.6 |          53 |        49.6 |

| mcdse-2b-v1         |    **70.9** |    **72.6** |    **66.4** |    **73.5** |    **72.6** |    **69.2** |

|                     | **+23.31%** | **+18.73%** | **+14.91%** | **+27.07%** | **+27.00%** | **+28.32%** |

## vLLM

This repo implements a new model class `Qwen2VLForEmbeddingGeneration` to support embedding generation with Qwen2VL models.

### Download mcdse-2b-v1 for local inference

```python

from huggingface_hub import snapshot_download

snapshot_download(repo_id="marco/mcdse-2b-v1", local_dir="/path/to/model/mcdse-2b-v1")

```

### Edit config.json

Replace `Qwen2VLForConditionalGeneration` with `Qwen2VLForEmbeddingGeneration`

```bash

sed -i -e 's/Qwen2VLForConditionalGeneration/Qwen2VLForEmbeddingGeneration/g' /path/to/model/mcdse-2b-v1/config.json

```

### Open `vllm/main.py` for usage instructions
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/marplex/mcdse

Awesome Lists containing this project

README