{"id":24600091,"url":"https://github.com/jparkerweb/down-craft","last_synced_at":"2025-03-18T06:41:34.080Z","repository":{"id":270045508,"uuid":"909138857","full_name":"jparkerweb/down-craft","owner":"jparkerweb","description":"📑 npm pacakge to Craft files into Markdown with ease","archived":false,"fork":false,"pushed_at":"2025-01-03T14:48:21.000Z","size":18256,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-28T23:50:44.085Z","etag":null,"topics":["converter","docx","markdown","nodejs","npm","ocr","pdf","pptx","vllm","xlsx"],"latest_commit_sha":null,"homepage":"https://www.npmjs.com/package/down-craft","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jparkerweb.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-27T20:43:10.000Z","updated_at":"2025-02-08T18:21:00.000Z","dependencies_parsed_at":"2024-12-28T01:17:49.715Z","dependency_job_id":"9f0446da-989b-4901-a6d0-99ed8b75036a","html_url":"https://github.com/jparkerweb/down-craft","commit_stats":null,"previous_names":["jparkerweb/down-craft"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jparkerweb%2Fdown-craft","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jparkerweb%2Fdown-craft/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jparkerweb%2Fdown-craft/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jparkerweb%2Fdown-craft/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jparkerweb","download_url":"https://codeload.github.com/jparkerweb/down-craft/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244173494,"owners_count":20410295,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["converter","docx","markdown","nodejs","npm","ocr","pdf","pptx","vllm","xlsx"],"created_at":"2025-01-24T13:18:55.461Z","updated_at":"2025-03-18T06:41:34.052Z","avatar_url":"https://github.com/jparkerweb.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 📑 Down Craft\n\nNode.js package to simplify the process of converting documents (PDF, DOCX, PPTX, and XLSX) into Markdown format. \nIt uses `tesseract.js`, `mammoth`, `pdf.js`, and `turndown` to convert documents to Markdown format. For PDFs, it also provides an option to use vLLMs (Vision Large Language Models) for advanced OCR capabilities (using the OpenAI API).\n\n![down-craft](https://raw.githubusercontent.com/jparkerweb/down-craft/main/down-craft.jpg)\n\n## Online Web Demo\nhttps://down-craft.dyndns.org/\n\n## Installation\n\n```bash\nnpm install down-craft\n```\n\n## Usage\n\n```javascript\nimport { downCraft } from 'down-craft';\nimport fs from 'fs/promises';\n\nasync function example() {\n  // Read file buffer\n  const fileBuffer = await fs.readFile('document.docx');\n  \n  // Convert to markdown (pass file buffer and file type)\n  const markdown = await downCraft(fileBuffer, 'docx');\n  \n  console.log(markdown);\n}\n```\n\n## Supported File Types\n\n- PDF (.pdf)\n- Microsoft Word (.docx)\n- Microsoft PowerPoint (.pptx)\n- Microsoft Excel (.xlsx)\n\n## API\n\n### downCraft(fileBuffer, fileType?, options?)\n\nConverts a document buffer to markdown format.\n\n- `fileBuffer` (Buffer): The document buffer to convert\n- `fileType` (string, optional): File type ('pdf', 'docx', 'pptx', 'xlsx'). If not provided the file type will be attempted to be auto-detected.\n- `options` (Object, optional): Conversion options\n  - `pdfConverterType` (string, optional): Converter to use for PDF files ('standard' | 'llm' | 'ocr'). Default: 'standard'\n  - `llmParams` (Object, required for 'llm' converter): LLM configuration\n    - `baseURL` (string): Base URL for the LLM API\n    - `apiKey` (string): API key for the LLM service\n    - `model` (string): Model to use for OCR\n    - `systemPrompt` (string, optional): System prompt for the LLM (see `.env.example` for the default)\n    - `userPrompt` (string, optional): User prompt for the LLM (see `.env.example` for the default)\n    - `temperature` (number, optional): Temperature for the LLM (default: 0)\n\nReturns: Promise\u003cstring\u003e - The markdown content\n\n\n#### PDF Conversions\n\n- **Standard**: Extracts text using standard techniques (images are ignored).\n- **vLLM**: Uses a vLLM-based OCR model to extract text from PDFs (high fidelity, but much slower and requires an LLM API endpoint).\n- **OCR**: Uses Tesseract.js for OCR (results are less accurate, but faster than using vLLM).\n\n## Special Features\n\n### vLLM-based PDF Conversion\n\nFor PDFs that require advanced OCR capabilities, you can use the vLLM converter:\n\n```javascript\nconst markdown = await downCraft(pdfBuffer, 'pdf', {\n  pdfConverterType: 'llm',\n  llmParams: {\n    baseURL: 'https://api.llm-service.com',\n    apiKey: 'your-api-key',\n    model: 'your-model-name'\n  }\n});\n```\n\nThis converter:\n- Extracts embedded images from the PDF\n- Converts PDF pages to high-quality images\n- Uses vLLM-based OCR for accurate text extraction\n- Automatically cleans up temporary files\n\nThe llmParams object will attempt to read environment variables for baseURL, apiKey, and model if you have them defined.\nSee the `.env.example` file for an example (it also shows an example of how you can define your own user/system prompts), as well as various LLM providers / models.\n\n## License\n\nThis package is licensed under the Apache 2.0 license.  \nSee LICENSE for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjparkerweb%2Fdown-craft","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjparkerweb%2Fdown-craft","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjparkerweb%2Fdown-craft/lists"}