{"id":32541116,"url":"https://github.com/fluidinference/mlx-mdx","last_synced_at":"2025-10-28T15:57:45.088Z","repository":{"id":319401256,"uuid":"1078540668","full_name":"FluidInference/mlx-mdx","owner":"FluidInference","description":"Website, pdfs, images, documents to Markdown. Powered by MLX models. Completely local and free ","archived":false,"fork":false,"pushed_at":"2025-10-19T02:03:34.000Z","size":2626,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-19T02:07:31.328Z","etag":null,"topics":["markdown","mlx","parser","web-crawler"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FluidInference.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-17T22:53:49.000Z","updated_at":"2025-10-19T02:03:38.000Z","dependencies_parsed_at":"2025-10-19T02:07:38.122Z","dependency_job_id":"c94f0a70-44e9-4ae4-9352-14861bc72f9c","html_url":"https://github.com/FluidInference/mlx-mdx","commit_stats":null,"previous_names":["fluidinference/mlx-mdx"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/FluidInference/mlx-mdx","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FluidInference%2Fmlx-mdx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FluidInference%2Fmlx-mdx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FluidInference%2Fmlx-mdx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FluidInference%2Fmlx-mdx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FluidInference","download_url":"https://codeload.github.com/FluidInference/mlx-mdx/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FluidInference%2Fmlx-mdx/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281467277,"owners_count":26506462,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-28T02:00:06.022Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["markdown","mlx","parser","web-crawler"],"created_at":"2025-10-28T15:57:40.331Z","updated_at":"2025-10-28T15:57:45.067Z","avatar_url":"https://github.com/FluidInference.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"left\"\u003e\n  \u003cimg src=\"banner.png\" alt=\"mlx-markdown banner\" width=\"360\" height=\"240\"\u003e\n\u003c/p\u003e\n\n# mlx-markdown\n\n[![Discord](https://img.shields.io/badge/Discord-Join%20Chat-7289da.svg)](https://discord.gg/WNsvaCtmDe)\n\n`mlx-markdown` modern take for something like `turndown` to easily convert documents, websites into Markdown that's loved by LLMs. `mlx-markdown` is powered by small MLX models that can run on any Apple Sillicon device and its flexible enough to swap out for other MLX models or even Pytorch models. Everything runs locally, so you can customize prompts, swap in different models, and adapt the flow for your publishing targets. Contributions and experiments are welcome—the CLI is intentionally small and hackable.\n\nThis will not give you cloud level performances but it gets you maybe 70-80% of the way there for quick use cases.\n\nThe resulting Markdown stays LLM-ready—headings, tables, and figure takeaways are explicit—so you can feed it straight into retrieval pipelines or even wire `mlx-mdx` up as an MCP tool to auto-parse documents on demand. This is particularly useful as not all codign agents can read or understand PDFs or deal with raw HTML.\n\n**Before:**\n\n\u003cimg src=\"./mermaid-example.png\" alt=\"Power of AI slide\" width=\"360\"\u003e\n\n**After:**\n    \u003ctd width=\"40%\" valign=\"top\"\u003e\n      \u003csmall\u003e\u003csmall\u003e\u003cpre\u003e\u003ccode class=\"language-mermaid\"\u003egraph LR\n    A[NVIDIA GPU VRAM] --\u003e B[Scenario 1: Both CUDA and Mac devices have enough memory]\n    A --\u003e C[Scenario 2: CUDA device has insufficient memory, but Mac device has enough memory]\n    A --\u003e D[Scenario 3: Mac device has enough memory but quite near the capacity while CUDA device has insufficient memory]\n    B --\u003e E[Apple Silicon Unified Memory]\n    C --\u003e E\n    D --\u003e E\u003c/code\u003e\u003c/pre\u003e\u003c/small\u003e\u003c/small\u003e\n\nThe graph in a PDF becomes structured text that mermaid can render.\n\n\n## Key Capabilities\n\n- `crawl`: Render websites with Playwright, isolate the readable article, and rewrite it with `mlx-community/jinaai-ReaderLM-v2`.\n- `document`: Transcribe PDFs, page images, and photo scans with `mlx-community/Nanonets-OCR2-3B-4bit`.\n- Outputs stay portable: assets are downloaded, validated, and relinked alongside YAML-front-matter Markdown.\n\n## Requirements\n\n- macOS on Apple Silicon (MLX requirement).\n- Python 3.12 (recommended).\n- [uv](https://docs.astral.sh/uv/) 0.4+ for packaging and tooling.\n- Playwright Chromium binaries for the `crawl` subcommand (install once with `uvx --from playwright python -m playwright install chromium`).\n\n## Quick Start\n\nChoose the path that fits how you want to run the CLI.\n\nIf you're not on [`uv`](https://docs.astral.sh/uv/getting-started/installation/) yet - you're missing out big.\n\n`curl -LsSf https://astral.sh/uv/install.sh | sh`\n\n### Run without installing with `uvx`\n\n```bash\nuvx --from git+https://github.com/FluidInference/mlx-mdx.git@v0.0.4 mlx-mdx --help\n```\n\n`uvx` (an alias for `uv tool run`) clones the repository into uv's cache, builds it, and launches the `mlx-mdx` entry point—handy for trying the pipelines without installing anything permanently.\n\n### Install as a uv tool\n\n```bash\nuv tool install --from git+https://github.com/FluidInference/mlx-mdx.git@v0.0.4 mlx-mdx\n\nuv tool run mlx-mdx -- crawl https://ml-explore.github.io/mlx/build/html/index.html --output output/mlx-docs --verbose\n\nuv tool run mlx-mdx -- document examples/2501.14925v2.pdf --output output/mlx-docs --verbose\n```\n\n`uv tool run` ensures the tool executes inside the managed environment even if your shell `PATH` is unaware of `~/.local/bin`. Swap `crawl` for `document` to run the OCR pipeline.\n\n## MCP integrations\n\nmlx-mdx can participate in MCP workflows two ways:\n\n- **One-shot CLI streaming** — run the existing CLI with `--mcp` to emit Markdown on STDOUT.\n- **Persistent MCP server** — launch the bundled `mlx-mdx-mcp` command so clients can discover the `crawl` and `document` tools without extra arguments.\n\n### Option 1: Stream with the CLI\n\n```bash\nuv tool run mlx-mdx -- crawl \"{{url}}\" --mcp --wait 2.0\nuv tool run mlx-mdx -- document /path/to/file-or-folder --mcp --verbose\n```\n\nConfigure your MCP client to execute the appropriate command (replace `crawl` with `document` for the OCR pipeline) and it will receive the generated Markdown over stdout.\n\n### Option 2: Run the MCP server\n\n```bash\nuv run mlx-mdx-mcp\n```\n\n(This runs the server from the current checkout; once a release includes the new entry point, you can also `uv tool install` the package and call `uv tool run mlx-mdx-mcp`.)\n\nThe server keeps memory usage low by loading MLX models only while a request is active. It exposes two tools:\n\n- `crawl(url)` — render a URL with Playwright and rewrite it as Markdown using default settings.\n- `document(path)` — transcribe PDFs, standalone images, or directories of page captures using default settings.\n\nEach client below supports registering a local stdio command as a custom MCP server. Point the command at `uv run --project /ABS/PATH/TO/mlx-mdx mlx-mdx-mcp` (swap in `uv tool run mlx-mdx-mcp` once a release ships) and both tools will appear automatically. For custom parameters (alternative models, longer timeouts, etc.), keep using the CLI streaming mode from option 1.\n\n- **Codex CLI** — in `~/.codex/config.toml`:\n\n  ```toml\n  [mcp_servers.mlx_mdx]\n  command = \"uv\"\n  args = [\"run\", \"--project\", \"/ABS/PATH/TO/mlx-mdx\", \"mlx-mdx-mcp\"]\n  ```\n\n- **Claude Code** — extend `~/Library/Application Support/Claude/claude_desktop_config.json`:\n\n  ```jsonc\n  \"mcpServers\": {\n    \"mlx-mdx\": {\n      \"command\": \"uv\",\n      \"args\": [\"run\", \"--project\", \"/ABS/PATH/TO/mlx-mdx\", \"mlx-mdx-mcp\"]\n    }\n  }\n  ```\n\n- **Cursor** — edit `~/.cursor/mcp.json`:\n\n  ```json\n  {\n    \"mcpServers\": {\n      \"mlx-mdx\": {\n        \"command\": \"uv\",\n        \"args\": [\"run\", \"--project\", \"/ABS/PATH/TO/mlx-mdx\", \"mlx-mdx-mcp\"]\n      }\n    }\n  }\n  ```\n\n- **Zed** — add a custom server in `~/.config/zed/settings.json` (see [Zed’s MCP guide](https://raw.githubusercontent.com/zed-industries/zed/main/docs/src/ai/mcp.md)):\n\n  ```json\n  {\n    \"context_servers\": {\n      \"mlx-mdx\": {\n        \"source\": \"custom\",\n        \"command\": \"uv\",\n        \"args\": [\"run\", \"--project\", \"/ABS/PATH/TO/mlx-mdx\", \"mlx-mdx-mcp\"]\n      }\n    }\n  }\n  ```\n\n- **OpenCode (Factory CLI)** — update `~/.factory/mcp.json`:\n\n  ```json\n  {\n    \"mcpServers\": {\n      \"mlx-mdx\": {\n        \"command\": \"uv\",\n        \"args\": [\"run\", \"--project\", \"/ABS/PATH/TO/mlx-mdx\", \"mlx-mdx-mcp\"]\n      }\n    }\n  }\n  ```\n\nAdd a second entry pointing at the CLI streaming command if you need fine-grained control (custom tokens, timeouts, etc.) alongside the persistent server. Replace `/ABS/PATH/TO/mlx-mdx` with the absolute path to this repository. Once a packaged release is available, you can switch the args back to `\"tool\", \"run\", \"mlx-mdx-mcp\"`.\n\n\n## Usage\n\nThe CLI exposes two focused subcommands. For backward compatibility, calling `mlx-mdx \u003curl\u003e` still routes to `crawl`.\n\n### Crawl websites\n\n```bash\nuv tool run mlx-mdx -- crawl https://example.com --output output/example --verbose\n```\n\nKey options:\n\n- `--output` — destination directory for Markdown and images (default: `./output`).\n- `--wait` — seconds to wait after Playwright reports the page is idle (default: `1.0`).\n- `--timeout` — navigation timeout in seconds (default: `30`).\n- `--model` — MLX model identifier (default: `mlx-community/jinaai-ReaderLM-v2`).\n- `--max-html-chars` / `--max-text-chars` — trim limits before passing content to the model.\n- `--verbose` — emit detailed logging, including readability decisions.\n\n### OCR documents or images\n\n```bash\nuv tool run mlx-mdx -- document examples/2501.14925v2.pdf --output output/docs --verbose\n```\n\nAccepts PDFs, standalone images, or directories of page images. Each input becomes `output/documents/\u003cslug\u003e/index.md`.\n\nThe VLM also looks at embedded charts and figures. Captions such as “Figure 4 …” are followed by a short `Figure insight:` summary so the Markdown captures the visual takeaway even when the image is absent.\n\nUseful flags:\n\n- `--model` — VLM identifier (default: `mlx-community/Nanonets-OCR2-3B-4bit`).\n- `--max-tokens` — limit Markdown length per page (default: `2048`).\n- `--temperature` — sampling temperature (default: `0.0`).\n- `--pdf-dpi` — render PDFs at this DPI before OCR (default: `200`).\n- `--max-image-side` — clamp the longest edge of page images (default: `2048`).\n- `--system-prompt` — override the OCR system instructions.\n- `--verbose` — emit per-page timings once transcription completes.\n\n## Outputs\n\nWebsite crawls produce:\n\n```text\n\u003coutput\u003e/\u003cdomain\u003e/\u003cslug\u003e/\n  ├─ index.md          # Markdown with YAML front matter\n  └─ images/           # Downloaded images (if any)\n```\n\nDocument transcription produces:\n\n```text\n\u003coutput\u003e/documents/\u003cslug\u003e/\n  └─ index.md          # Markdown assembled from per-page OCR\n```\n\n## Examples\n\nBrowse `examples/` for sample outputs. For a larger knowledge base that uses these pipelines, see [möbius](https://github.com/FluidInference/mobius) and adapt the prompts or publishing recipes for your own content.\n\n### Work from a local checkout\n\n1. `uv sync` — creates `.venv/` with the runtime dependencies.\n2. `uv run python -m playwright install chromium` — downloads the browser used by `crawl`.\n3. `uv run mlx-mdx crawl https://ml-explore.github.io/mlx/build/html/index.html --output output/mlx-docs --verbose`\n4. _(Optional)_ `uv tool run ty check --python .venv/bin/python` — mirror the CI static type check.\n\nUse `uv run` for development tasks inside the synced virtual environment.\n\n## Pipeline Overview\n\n```text\ncrawl input (URL)\n  │\n  ▼\nPlaywright renders the page in Chromium (handles client-side rendering)\n  │\n  ▼\nReadability extracts the main article HTML\n  │\n  ▼\nReaderLM (MLX) rewrites the content as clean Markdown (no YAML in generation)\n  │\n  ▼\nImage pipeline downloads, validates, and relinks assets\n  │\n  ▼\nOutputs saved to \u003coutput\u003e/\u003cdomain\u003e/\u003cslug\u003e/index.md (with optional images/)\n```\n\n```text\ndocument input (PDFs, images, or directories)\n  │\n  ▼\nOptional PDF rendering with pypdfium2\n  │\n  ▼\nNanonets OCR (MLX VLM) transcribes each page to structured Markdown\n  │\n  ▼\nMarkdown composer stitches pages with metadata and downloads referenced assets\n  │\n  ▼\nOutputs saved to \u003coutput\u003e/documents/\u003cslug\u003e/index.md\n```\n\n## Operational Notes\n\n- First runs of new models download weights from Hugging Face; subsequent runs reuse the cache.\n- Only common web image formats under 10 MB are saved; files smaller than 512 bytes are skipped.\n- Remove the output directory manually (`rm -rf output`) to clear prior runs.\n- The crawler only processes URLs you provide—it does not follow links or recurse through sites.\n- Document OCR relies on [`mlx-vlm`](https://pypi.org/project/mlx-vlm/). PDF rendering uses `pypdfium2`; if it is missing, reinstall with extras or provide page images directly.\n\n## Disclaimer\n\nPlease follow each site's terms of service. You are responsible for respecting rate limits and avoiding bans.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffluidinference%2Fmlx-mdx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffluidinference%2Fmlx-mdx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffluidinference%2Fmlx-mdx/lists"}