{"id":26512929,"url":"https://github.com/roffys/markeverythingdown","last_synced_at":"2025-06-27T07:33:21.650Z","repository":{"id":282659378,"uuid":"949270265","full_name":"RoffyS/MarkEverythingDown","owner":"RoffyS","description":"Convert files (PDF, image, Word, PPT, Excel, notebooks, code snippets) to markdown using powerful multimodal LLM","archived":false,"fork":false,"pushed_at":"2025-05-08T22:26:14.000Z","size":5057,"stargazers_count":232,"open_issues_count":6,"forks_count":22,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-05-08T23:29:33.900Z","etag":null,"topics":["image-processing","markdown","markdownconversion","microsoft-office","multimodal","pdf","pdftomarkdown","qwen"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RoffyS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-16T03:59:18.000Z","updated_at":"2025-05-08T22:26:18.000Z","dependencies_parsed_at":"2025-04-06T02:33:30.502Z","dependency_job_id":null,"html_url":"https://github.com/RoffyS/MarkEverythingDown","commit_stats":null,"previous_names":["roffys/markeverythingdown"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/RoffyS/MarkEverythingDown","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoffyS%2FMarkEverythingDown","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoffyS%2FMarkEverythingDown/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoffyS%2FMarkEverythingDown/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoffyS%2FMarkEverythingDown/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RoffyS","download_url":"https://codeload.github.com/RoffyS/MarkEverythingDown/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoffyS%2FMarkEverythingDown/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262212784,"owners_count":23275986,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["image-processing","markdown","markdownconversion","microsoft-office","multimodal","pdf","pdftomarkdown","qwen"],"created_at":"2025-03-21T04:19:07.098Z","updated_at":"2025-06-27T07:33:21.631Z","avatar_url":"https://github.com/RoffyS.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MarkEverythingDown\n[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/RoffyS/MarkEverythingDown)\n+ **MarkEverythingDown** - 你的全能文档Markdown转换神器！🚀\n  一键将PDF/Office/图片/代码等文件转换为结构清晰的Markdown，专为LLM优化设计。结合Qwen2.5 VL视觉模型，连扫描件都能智能解析！\n\n## ✨ 优势\n✅ **AI超能力** - 深度集成Qwen2.5 VL模型，完美保留表情符号和图像描述  \n✅ **格式全覆盖** - 从微信截图到学术论文统统搞定  \n✅ **双模处理** - 本地/云端自由切换，隐私与性能兼得  \n✅ **小白友好** - 无需代码，拖拽文件立即转换  \n✅ **智能分批** - 优化处理大型PDF文档，自动调整批次大小\n\n**MarkEverythingDown** is a versatile document conversion tool that transforms various file formats into clean, structured markdown. Whether you're working with PDFs, Office documents, images, code files, or notebooks, MarkEverythingDown provides a unified interface to convert them all.\n\nThe tool is specifically designed to leverage **Qwen2.5 VL** models through OpenAI-compatible APIs, supporting both local inference engines like LMStudio and cloud API providers like DashScope. This design enables high-quality processing of visual content while maintaining flexibility in deployment options.\n\nI developed this tool to streamline the conversion of documents into markdown format, which is both LLM-friendly and easy for human to read. The goal is to make document processing as seamless as possible, allowing users to easily convert their files for RAG applications or SFT dataset preparations.\n\n## Roadmap\n\n### Recently Implemented (April 2025)\n\n#### Enhanced Processing Options\n- ✅ **Temperature Control**: Added temperature parameter (0.0-1.0) for controlling the determinism of AI output\n- ✅ **Max Tokens Setting**: Implemented customizable token limits for generation\n- ✅ **Multi-Page Processing**: Added support for processing multiple PDF pages in a single API call\n- ✅ **Dynamic Batch Sizing**: Implemented intelligent adjustment of batch sizes based on page complexity\n- ✅ **Optimized Token Management**: Added max_tokens_per_batch option to prevent token limit issues\n\n#### Improved Document Support\n- ✅ **Enhanced Table Handling in Word Documents**: Better preservation of table structure and formatting in DOCX files\n- ✅ **Excel Spreadsheet Support**: Full support for XLSX files with proper table formatting\n- ✅ **Better Visual Elements Preservation**: Improved handling of emojis and image descriptions\n\n#### Interface Improvements\n- ✅ **Enhanced UI Tooltips**: Clearer explanations of processing options\n- ✅ **Improved Error Handling**: Better feedback for processing issues\n- ✅ **Progress Indicators**: Added visual feedback during processing\n\n### Planned Features\n\n#### Near-term\n- 🔜 **CSV and TSV Support**: Native support for tabular data files\n- 🔜 **Custom Templates**: User-defined output formats for different document types\n- 🔜 **Batch Processing Improvements**: Enhanced management of large document collections\n\n#### Long-term\n- 🔜 **Multi-model Support**: Integration with additional vision-language models\n- 🔜 **Advanced Document Analysis**: Improved extraction of complex structures like footnotes and citations\n- 🔜 **API Mode**: Headless operation for integration with other applications\n- 🔜 **Collaborative Editing**: Real-time collaborative editing of converted documents\n\n## Features\n\n- **Multi-format support**: Convert PDFs, DOCX, PPTX, XLSX, images, code files, notebooks, and markdown variants\n- **Intelligent processing**: Automatically selects the appropriate processor for each file type\n- **Vision AI support**: Optimized for Qwen2.5 VL models with OpenAI-compatible interface\n- **Dual processing options**: Support local inference APIs and cloud APIs\n- **Batch processing**: Process multiple files at once with a simple interface\n- **User-friendly UI**: Easy-to-use web UI with Gradio and helpful tooltips\n- **Command line interface**: Quick conversions from the terminal\n\n## Supported Formats\n\n| Category | Formats |\n|----------|---------|\n| Documents | PDF, DOCX, PPTX, XLSX |\n| Images | PNG, JPG, JPEG, BMP |\n| Code | Python, R, and other programming languages |\n| Notebooks | Jupyter Notebooks (ipynb) |\n| Markdown | MD, RMD (R Markdown) |\n| Text | TXT |\n\n## Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/RoffyS/MarkEverythingDown.git\ncd MarkEverythingDown\n\n# Set up a virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install dependencies\npip install -r requirements.txt\n```\n\n## Usage\n\n### Web UI (Recommended)\n\n```bash\n# Launch the web interface\npython main.py --ui\n```\n\n![GUI](ui/GUI.png)\n\nThe MarkEverythingDown web interface provides an intuitive way to convert your documents to markdown:\n\n1. **Upload Files**: Drag and drop single or multiple files into the upload area\n2. **Configure Output**: Specify where you want your converted markdown files to be saved\n3. **Processing Options**: \n   - **Concurrent Processing**: Control how many API calls are made at once\n   - **Pages Per Batch**: Set the maximum number of PDF pages to send in each API call\n   - **Dynamic Batching**: Enable automatic adjustment of `Pages Per Batch` based on page complexity and token limits\n   - **Temperature**: Adjust the creativity level of the AI (0.0 for deterministic results)\n   - **Max Tokens**: Set token limits for generation (blank uses model default)\n\n4. **API Configuration**: Configure API settings for your vision model:\n   - **API Key**: Your API key (default: \"lmstudio\" for local inference)\n   - **API URL**: The base URL for your API endpoint (default: \"http://localhost:1234/v1\")\n   - **Model Name**: The model to use for processing (default: \"qwen2.5-vl-7b-instruct\")\n\n### Command Line\n\n```bash\npython main.py sample_pdf.pdf # path to input file \\\n    --api-key lm_studio \\\n    --base-url http://localhost:1234/v1 \\\n    --model qwen2.5-vl-32b-instruct \\\n    --force-vision \\\n    --max-concurrent 1 \\\n    --output test \\\n    --images-per-batch 1 \\\n    --dynamic-batching \\\n    --max-tokens-per-batch 8192\n```\n\n#### Command Line Options\n\n| Option | Description | Default |\n|--------|-------------|---------|\n| `--output`, `-o` | Output directory for markdown files | `output` |\n| `--ui` | Launch the graphical user interface | - |\n| `--force-vision` | Use vision model for PDFs instead of text extraction | `False` |\n| `--max-concurrent` | Maximum concurrent workers for PDF page processing | `2` |\n| `--images-per-batch` | Maximum number of PDF pages per API call | `1` |\n| `--dynamic-batching` | Automatically adjust images-per-batch based on page complexity and maximum tokens per batch | `True` |\n| `--no-dynamic-batching` | Disable dynamic batching | - |\n| `--max-tokens-per-batch` | Maximum tokens per batch for dynamic batching | `4000` |\n| `--temperature` | Temperature for generation (0.0-1.0) | `0.0` |\n| `--max-tokens` | Maximum tokens for generation | Model default |\n| `--api-key` | API key for vision processor | `lmstudio` |\n| `--base-url` | Base URL for API endpoint | `http://localhost:1234/v1` |\n| `--model` | Model name to use | `qwen2.5-vl-7b-instruct` |\n\n## Example Use Cases\n\nBelow are several examples of converting images, PDFs, and Office documents into markdown format. You are welcome to try it out with your own documents, either through the web UI or the command line. You can also play around with the prompt templates in ***processors/vision/vision_processor.py*** to customize the output format of PDFs and images.\n\n### 1. Image Processing\n\n#### Course Slides with Images \n\n**Input**: ![test_image1.png - A slide about the Turing Award](test_docs/test_image1.png)\n\n**Output** (`test_output/test_image1.md`):\n```markdown\n# 2018 Turing Award for deep learning\n\nThe most prestigious technical award, given to individuals who have made major \ncontributions of lasting importance to computing.\n\n## Recipients\n\n- **Geoffrey Hinton**\n- **Yoshua Bengio**\n- **Yann LeCun**\n\n## Lecture Details\n- **Lecture 1 - Slide 27**\n- **Date:** April 4, 2023\n- **Presenters:** Fei-Fei Li, Yunzhu Li, Ruohan Gao\n```\n\n#### Course Slides with Code\n\n**Input**: ![test_image3.png - A slide about basic R programming](test_docs/test_image3.png)\n\n**Output** (`test_output/test_image3.md`):\n```markdown\n# Basic Data Types in R: Numeric\n\n## Numeric: Default Data Type in R Representing Decimal Values\n\n- **Numeric:** The default data type in R for representing decimal values.\n  - Assign a decimal value:\n    ```R\n    x \u003c- 3.14\n    ```\n  - Print the value of `x`:\n    ```R\n    x\n    # [1] 3.14\n    ```\n  - Print the class name of `x`:\n    ```R\n    class(x)\n    # [1] \"numeric\"\n    ```\n  - Assign an integer value:\n    ```R\n    k \u003c- 3\n    ```\n  - Print the value of `k`:\n    ```R\n    k\n    # [1] 3\n    ```\n  - Print the class name of `k`:\n    ```R\n    class(k)\n    # [1] \"numeric\"\n    ```\n- Even integer values are stored as numeric unless explicitly declared:\n    ```R\n    class(k)\n    # [1] \"numeric\"\n    ```\n  - Check if `k` is an integer:\n    ```R\n    is.integer(k)\n    # [1] FALSE\n    ```\n\n## Try it Yourself:\n- [Link to Practice](https://campus.datacamp.com/courses/r-short-and-sweet/hello-r?ex=2)\n```\n\n#### WeChat Screenshot\n\n**Input**: ![test_image2.png - A WeChat screenshot](test_docs/test_image2.png)\n\n**Output** (`test_output/test_image2.md`):\n```markdown\n# WeChat Transcript\n\n**Sender:** User 1  \n\u003e Can't believe I'm using a random WeChat history generator to create a test case\n\n**Sender:** User 2  \n\u003e Guess they will never know\n\n**Sender:** User 1  \n\u003e yea alright\n```\n\n\n### 2. PDF Processing (Two Methods)\n\n#### Text Extraction (Default)\n\n**Input**: [sample_pdf.pdf](test_docs/sample_pdf.pdf)\n\n**Output** (`test_output/sample_pdf_noVision.md`):\n\nAs you can tell, PDF is a really tricky format to process. The output is not very clean, and the formatting is not preserved.\n\n```markdown\n## Page 1\n\nMarch 5, 2025\nQwen2.5-VL Technical Report\nQwen Team, Alibaba Group\nhttps://chat.qwenlm.aihttps://huggingface.co/Qwen\nhttps://modelscope.cn/organization/qwenhttps://github.com/QwenLM/Qwen2.5-VL\nAbstract\nWe introduce Qwen2.5-VL, the latest ﬂagship model of Qwen vision-language series,\nwhich demonstrates signiﬁcant advancements in both foundational capabilities and\ninnovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding\nand interacting with the world through enhanced visual recognition, precise object local-\nization, robust document parsing, and long-video comprehension. A standout feature of\nQwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It\nprovides robust structured data extraction from invoices, forms, and tables, as well as\ndetailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-\nVL introduces dynamic resolution processing and absolute time encoding, enabling it\nto process images of varying sizes and videos of extended durations (up to hours) with\nsecond-level event localization. This allows the model to natively perceive spatial scales\nand temporal dynamics without relying on traditional normalization techniques. By\ntraining a native dynamic-resolution Vision Transformer (ViT) from scratch and incorpo-\nrating Window Attention, we have signiﬁcantly reduced computational overhead while\nmaintaining native resolution. As a result, Qwen2.5-VL excels not only in static image\nand document understanding but also as an interactive visual agent capable of reasoning,\ntool usage, and task execution in real-world scenarios such as operating computers and\nmobile devices. The model achieves strong generalization across domains without requir-\ning task-speciﬁc ﬁne-tuning. Qwen2.5-VL is available in three sizes, addressing diverse\nuse cases from edge AI to high-performance computing. The ﬂagship Qwen2.5-VL-72B\nmodel matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly\nexcelling in document and diagram understanding. The smaller Qwen2.5-VL-7B and\nQwen2.5-VL-3B models outperform comparable competitors, offering strong capabilities\neven in resource-constrained environments. Additionally, Qwen2.5-VL maintains robust\nlinguistic performance, preserving the core language competencies of the Qwen2.5 LLM.\n1arXiv:2502.13923v1  [cs.CV]  19 Feb 2025\n\n## Page 2\n\n1Introduction\nLarge vision-language models ( LVLMs ) ( OpenAI ,2024;Anthropic ,2024a ;Team et al. ,2023;Wang et al. ,\n2024f ) represent a pivotal breakthrough in artiﬁcial intelligence, signaling a transformative approach to\nmultimodal understanding and interaction. By seamlessly integrating visual perception with natural\nlanguage processing, these advanced models are fundamentally reshaping how machines interpret and\nanalyze complex information across diverse domains. Despite signiﬁcant advancements in multimodal\nlarge language models, the current capabilities of these models can be likened to the middle layer of a\nsandwich cookie—competent across various tasks but falling short of exceptional performance. Fine-\ngrained visual tasks form the foundational layer of this analogy. In this iteration of Qwen2.5-VL, we\nare committed to exploring ﬁne-grained perception capabilities, aiming to establish a robust foundation\nfor LVLMs and create an agentic ampliﬁer for real-world applications. The top layer of this framework\nis multi-modal reasoning, which is enhanced by leveraging the latest Qwen2.5 LLM and employing\nmulti-modal QA data construction.\nA spectrum of works have promoted the development of multimodal large models, characterized by\narchitectural design, visual input processing, and data curation. One of the primary drivers of progress\nin LVLMs is the continuous innovation in architecture. The studies presented in ( Alayrac et al. ,2022;\nLi et al. ,2022a ;2023b ;Liu et al. ,2023b ;a;Wang et al. ,2024i ;Zhang et al. ,2024b ;Wang et al. ,2023) have\nincrementally shaped the current paradigm, which typically consists of a visual encoder, a cross-modal\nprojector, and LLM. Fine-grained perception models have emerged as another crucial area. Models like\n(Xiao et al. ,2023;Liu et al. ,2023c ;Ren et al. ,2024;Zhang et al. ,2024a ;d;Peng et al. ,2023;Deitke et al. ,\n2024) have pushed the boundaries of what is possible in terms of detailed visual understanding. The\narchitectures of Omni ( Li et al. ,2024g ;2025b ;Ye et al. ,2024) and MoE ( Riquelme et al. ,2021;Lee et al. ,\n2024;Li et al. ,2024h ;c;Wu et al. ,2024b ) also inspire the future evolution of LVLMs. Enhancements in\nvisual encoders ( Chen et al. ,2023;Liu et al. ,2024b ;Liang et al. ,2025) and resolution scaling ( Li et al. ,\n2023c ;Ye et al. ,2023;Li et al. ,2023a ) have played a pivotal role in improving the quality of practical\nvisual understanding. Curating data with more diverse scenarios and higher-quality is an essential step\nin training advanced LVLMs. The efforts proposed in ( Guo et al. ,2024;Chen et al. ,2024d ;Liu et al. ,2024a ;\nChen et al. ,2024a ;Tong et al. ,2024;Li et al. ,2024a ) are highly valuable contributions to this endeavor.\nHowever, despite their remarkable progress, vision-language models currently face developmental\nbottlenecks, including computational complexity, limited contextual understanding, poor ﬁne-grained\nvisual perception, and inconsistent performance across varied sequence length.\nIn this report, we introduce the latest work Qwen2.5-VL, which continues the open-source philosophy of\nthe Qwen series, achieving and even surpassing top-tier closed-source models on various benchmarks.\nTechnically, our contributions are four-folds: (1) We implement window attention in the visual encoder to\noptimize inference efﬁciency; (2) We introduce dynamic FPS sampling, extending dynamic resolution to\nthe temporal dimension and enabling comprehensive video understanding across varied sampling rates;\n(3) We upgrade MRoPE in the temporal domain by aligning to absolute time, thereby facilitating more\nsophisticated temporal sequence learning; (4) We make signiﬁcant efforts in curating high-quality data\nfor both pre-training and supervised ﬁne-tuning, further scaling the pre-training corpus from 1.2 trillion\ntokens to 4.1 trillion tokens.\nThe sparkling characteristics of Qwen2.5-VL are as follows:\n•Powerful document parsing capabilities: Qwen2.5-VL upgrades text recognition to omni-\ndocument parsing, excelling in processing multi-scene, multilingual, and various built-in (hand-\nwriting, tables, charts, chemical formulas, and music sheets) documents.\n•Precise object grounding across formats: Qwen2.5-VL unlocks improved accuracy in detecting,\npointing, and counting objects, accommodating absolute coordinate and JSON formats for\nadvanced spatial reasoning.\n•Ultra-long video understanding and ﬁne-grained video grounding: Our model extends native\ndynamic resolution to the temporal dimension, enhancing the ability to understand videos lasting\nhours while extracting event segments in seconds.\n•Enhanced agent Functionality for computer and mobile devices: Leverage advanced grounding,\nreasoning, and decision-making abilities, boosting the model with superior agent functionality\non smartphones and computers.\n2\n```\n\n#### Vision Processing (For Scanned Documents)\n\n**Output** (`test_output/sample_pdf_vision.md`):\n\nWith the superb document parsing capability of Qwen2.5 VL, the output is much cleaner, and the original structure is preserved.\n\n```markdown\n# Qwen2.5-VL Technical Report\n\n**Qwen Team, Alibaba Group**\n\n🔗 https://chat.qwenlm.ai  \n🤖 https://huggingface.co/Qwen  \n🌐 https://modelscope.cn/organization/qwen  \n🐙 https://github.com/QwenLM/Qwen2.5-VL\n\n## Abstract\n\nWe introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we have significantly reduced computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. The model achieves strong generalization across domains without requiring task-specific fine-tuning. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. The smaller Qwen2.5-VL-7B and Qwen2.5-VL-3B models outperform comparable competitors, offering strong capabilities even in resource-constrained environments. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.\n\n![Performance comparison of Qwen2.5-VL models against other leading models](image.png)\n\nThe figure above shows a comparative analysis of the performance metrics for various Qwen2.5-VL models alongside other prominent models. Each slice represents different evaluation criteria, highlighting the superior performance of Qwen2.5-VL across multiple dimensions.\n\n# 1 Introduction\n\nLarge vision-language models (LVLMs) (OpenAI, 2024; Anthropic, 2024a; Team et al., 2023; Wang et al., 2024f) represent a pivotal breakthrough in artificial intelligence, signaling a transformative approach to multimodal understanding and interaction. By seamlessly integrating visual perception with natural language processing, these advanced models are fundamentally reshaping how machines interpret and analyze complex information across diverse domains. Despite significant advancements in multimodal large language models, the current capabilities of these models can be likened to the middle layer of a sandwich cookie—competent across various tasks but falling short of exceptional performance. Fine-grained visual tasks form the foundational layer of this analogy. In this iteration of Qwen2.5-VL, we are committed to exploring fine-grained perception capabilities, aiming to establish a robust foundation for LVLMs and create an agentic amplifier for real-world applications. The top layer of this framework is multi-modal reasoning, which is enhanced by leveraging the latest Qwen2.5 LLM and employing multi-modal QA data construction.\n\nA spectrum of works has promoted the development of multimodal large models, characterized by architectural design, visual input processing, and data curation. One of the primary drivers of progress in LVLMs is the continuous innovation in architecture. The studies presented in (Alayrac et al., 2022; Li et al., 2022a; 2023b; Liu et al., 2023b;a; Wang et al., 2024; Zhang et al., 2024b; Wang et al., 2023) have incrementally shaped the current paradigm, which typically consists of a visual encoder, a cross-modal projector, and LLM. Fine-grained perception models have emerged as another crucial area. Models like (Xiao et al., 2023; Liu et al., 2023c; Ren et al., 2024; Zhang et al., 2024a;d; Peng et al., 2023; Deitke et al., 2024) have pushed the boundaries of what is possible in terms of detailed visual understanding. The architectures of Omni (Li et al., 2024g; 2025b; Ye et al., 2024) and MoE (Riquelme et al., 2021; Lee et al., 2024; Li et al., 2024h;c; Wu et al., 2024b) also inspire the future evolution of LVLMs. Enhancements in visual encoders (Chen et al., 2023; Liu et al., 2024b; Liang et al., 2025) and resolution scaling (Li et al., 2023c; Ye et al., 2023; Li et al., 2023a) have played a pivotal role in improving the quality of practical visual understanding. Curating data with more diverse scenarios and higher-quality is an essential step in training advanced LVLMs. The efforts proposed in (Guo et al., 2024; Chen et al., 2024d; Liu et al., 2024a; Chen et al., 2024a; Tong et al., 2024; Li et al., 2024a) are highly valuable contributions to this endeavor.\n\nHowever, despite their remarkable progress, vision-language models currently face developmental bottlenecks, including computational complexity, limited contextual understanding, poor fine-grained visual perception, and inconsistent performance across varied sequence length.\n\nIn this report, we introduce the latest work Qwen2.5-VL, which continues the open-source philosophy of the Qwen series, achieving and even surpassing top-tier closed-source models on various benchmarks. Technically, our contributions are four-folds: (1) We implement window attention in the visual encoder to optimize inference efficiency; (2) We introduce dynamic FPS sampling, extending dynamic resolution to the temporal dimension and enabling comprehensive video understanding across varied sampling rates; (3) We upgrade MRoPE in the temporal domain by aligning to absolute time, thereby facilitating more sophisticated temporal sequence learning; (4) We make significant efforts in curating high-quality data for both pre-training and supervised fine-tuning, further scaling the pre-training corpus from 1.2 trillion tokens to 4.1 trillion tokens.\n\nThe sparkling characteristics of Qwen2.5-VL are as follows:\n\n- **Powerful document parsing capabilities:** Qwen2.5-VL upgrades text recognition to omni-document parsing, excelling in processing multi-scene, multilingual, and various built-in (handwriting, tables, charts, chemical formulas, and music sheets) documents.\n- **Precise object grounding across formats:** Qwen2.5-VL unlocks improved accuracy in detecting, pointing, and counting objects, accommodating absolute coordinate and JSON formats for advanced spatial reasoning.\n- **Ultra-long video understanding and fine-grained video grounding:** Our model extends native dynamic resolution to the temporal dimension, enhancing the ability to understand videos lasting hours while extracting event segments in seconds.\n- **Enhanced agent Functionality for computer and mobile devices:** Leverage advanced grounding, reasoning, and decision-making abilities, boosting the model with superior agent functionality on smartphones and computers.\n```\n\n### 3. Office Document Processing\n\n#### Excel Spreadsheets (XLSX)\n**Input**: [sample_excel.xlsx](test_docs/sample_excel.xlsx)\n\n**Output** (`test_output/sample_excel.md`):\n```markdown\n# Excel Document: sample_excel\n\n\n## Sheet: sample_excel (9 rows × 28 columns)\n\n|   gvkey | datadate            |   fyear | indfmt   | consol   | popsrc   | datafmt   | tic   |   ajex | curcd   |   fyr |   apdedate |   fdate |   pdate |     act |      at |     che |    csho |     dlc |    dltt |     lct |      ni |   oancf |   utfdoc |    cik | costat   |   prcc_f |      gsubind |\n|--------:|:--------------------|--------:|:---------|:---------|:---------|:----------|:------|-------:|:--------|------:|-----------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|---------:|-------:|:---------|---------:|-------------:|\n|    1000 | 1977-12-31 00:00:00 |    1977 | INDL     | C        | D        | STD       | AE.2  |      1 | USD     |    12 |        nan |     nan |     nan |  23.548 |  44.025 |   1.303 |   2.226 |   0.533 |  18.116 |   8.236 |   1.928 |     nan |      nan |    nan | I        |    9.25  | nan          |\n|    1001 | 1978-12-31 00:00:00 |    1978 | INDL     | C        | D        | STD       | AMFD. |      1 | USD     |    12 |        nan |     nan |     nan | nan     | nan     | nan     | nan     | nan     | nan     | nan     | nan     |     nan |      nan | 723576 | I        |  nan     |   2.5301e+07 |\n|    1001 | 1979-12-31 00:00:00 |    1979 | INDL     | C        | D        | STD       | AMFD. |      1 | USD     |    12 |        nan |     nan |     nan | nan     | nan     | nan     | nan     | nan     | nan     | nan     | nan     |     nan |      nan | 723576 | I        |  nan     |   2.5301e+07 |\n|    1001 | 1980-12-31 00:00:00 |    1980 | INDL     | C        | D        | STD       | AMFD. |      1 | USD     |    12 |        nan |     nan |     nan | nan     | nan     | nan     | nan     | nan     | nan     | nan     | nan     |     nan |      nan | 723576 | I        |  nan     |   2.5301e+07 |\n|    1001 | 1981-12-31 00:00:00 |    1981 | INDL     | C        | D        | STD       | AMFD. |      1 | USD     |    12 |        nan |     nan |     nan | nan     | nan     | nan     | nan     | nan     | nan     | nan     | nan     |     nan |      nan | 723576 | I        |  nan     |   2.5301e+07 |\n|    1001 | 1982-12-31 00:00:00 |    1982 | INDL     | C        | D        | STD       | AMFD. |      1 | USD     |    12 |        nan |     nan |     nan | nan     | nan     | nan     | nan     | nan     | nan     | nan     | nan     |     nan |      nan | 723576 | I        |  nan     |   2.5301e+07 |\n|    1001 | 1983-12-31 00:00:00 |    1983 | INDL     | C        | D        | STD       | AMFD. |      1 | USD     |    12 |        nan |     nan |     nan |   4.807 |  14.08  |   4.28  |   3.568 |   0.52  |   4.344 |   1.913 |   1.135 |     nan |      nan | 723576 | I        |    7.25  |   2.5301e+07 |\n|    1001 | 1984-12-31 00:00:00 |    1984 | INDL     | C        | D        | STD       | AMFD. |      1 | USD     |    12 |        nan |     nan |     nan |   2.789 |  16.267 |   1.986 |   3.568 |   0.597 |   4.181 |   2.767 |   1.138 |     nan |      nan | 723576 | I        |    3.75  |   2.5301e+07 |\n|    1001 | 1985-12-31 00:00:00 |    1985 | INDL     | C        | D        | STD       | AMFD. |      1 | USD     |    12 |        nan |     nan |     nan |   3.852 |  39.495 |   2.787 |   3.988 |   8.336 |  11.908 |  13.922 |   2.576 |     nan |      nan | 723576 | I        |   10.125 |   2.5301e+07 |\n```\n\n#### Word Documents (DOCX)\n\n**Input**: [sample_docx.docx](test_docs/sample_docx.docx)\n\n**Output** (`test_output/sample_docx.md`):\n```markdown\n# This is a Level 1 Heading\n\n## This is a Level 2 Heading\n\n### This is a Level 3 Heading\n\nThis is normal text with a simple table below:\n\n| Col1 | Col2 | Col3 |\n| --- | --- | --- |\n| abc | abc | abc |\n```\n\n#### PowerPoint Presentations (PPTX)\n\n**Input**: [sample_pptx.pptx](test_docs/sample_pptx.pptx)\n\n**Output** (`test_output/sample_pptx.md`):\n```markdown\n# sample\n\n## Slide 1\n\n### This is a Sample Slide Deck\n\nHail the Almighty MarkEverythingDown\n---\n```\n\n\n## Project Structure\n\n```\nMarkEverythingDown/\n├── main.py               # Entry point\n├── ui/\n│   └── app.py            # Gradio UI implementation\n├── processors/\n│   ├── base.py           # Base processor classes\n│   ├── text/             # Text-based processors\n│   └── vision/           # Image/PDF vision processors\n├── test_docs/            # Example documents\n│   ├── sample.pdf\n│   ├── sample.docx\n│   └── ...\n├── test_output/          # Example processed results\n│   ├── sample_pdf_vision.md\n│   ├── sample_docx.md\n│   └── ...\n└── requirements.txt      # Dependencies\n```\n\n\n## License\n\nMIT License\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request or open an Issue.\n\n## Acknowledgements\n\nI was inspired to create this project when randomly browsing Andrej Karpathy's X and I saw this tweet:\n\n```text\nIt's 2025 and most content is still written for humans instead of LLMs. 99.9% of attention is about to be LLM attention, not human attention.\n\nE.g. 99% of libraries still have docs that basically render to some pretty .html static pages assuming a human will click through them. In 2025 the docs should be a single your_project.md text file that is intended to go into the context window of an LLM.\n\nRepeat for everything.\n```\n\nSo I thought, why not create a tool that can convert any document into a LLM-friendly format? And here we are!\n\nIn addition, this project won't be possible without the amazing work of the Qwen team and the open-source community. Special thanks to the developers of the Qwen2.5 VL models and the various libraries used in this project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froffys%2Fmarkeverythingdown","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Froffys%2Fmarkeverythingdown","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froffys%2Fmarkeverythingdown/lists"}