{"id":13858723,"url":"https://github.com/michaelfranzl/docx_converter","last_synced_at":"2025-07-11T04:31:06.657Z","repository":{"id":12839073,"uuid":"15514610","full_name":"michaelfranzl/docx_converter","owner":"michaelfranzl","description":"Ruby gem converting Word docx files into html or LaTeX via the Kramdown syntax","archived":false,"fork":false,"pushed_at":"2021-08-30T12:54:00.000Z","size":19,"stargazers_count":34,"open_issues_count":1,"forks_count":16,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-09-17T16:45:23.527Z","etag":null,"topics":["docx-to-markdown","kramdown","ruby"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/michaelfranzl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-12-29T20:35:18.000Z","updated_at":"2024-07-10T23:05:26.000Z","dependencies_parsed_at":"2022-08-30T11:30:12.911Z","dependency_job_id":null,"html_url":"https://github.com/michaelfranzl/docx_converter","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelfranzl%2Fdocx_converter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelfranzl%2Fdocx_converter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelfranzl%2Fdocx_converter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelfranzl%2Fdocx_converter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/michaelfranzl","download_url":"https://codeload.github.com/michaelfranzl/docx_converter/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225674898,"owners_count":17506272,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docx-to-markdown","kramdown","ruby"],"created_at":"2024-08-05T03:02:18.792Z","updated_at":"2024-11-21T04:59:24.356Z","avatar_url":"https://github.com/michaelfranzl.png","language":"Ruby","funding_links":[],"categories":["Ruby"],"sub_categories":[],"readme":"docx-converter\n=================\n\nThis Ruby library (gem) parses and translates `.docx` Word documents into kramdown syntax, which allows for easy subsequent translation into `html` or `TeX` code via the excellent `kramdown` library. `kramdown` is a superset of `Markdown`. See http://kramdown.gettalong.org/ for more details.\n\nA `.docx` file as written by modern versions of Microsoft Office is just a `.zip` file in disguise. It contains a directory tree containing XML files. Parsing of these compressed XML trees is rather staightforward, thanks to the `zip` and `nokogiri` Ruby libraries.\n\n`docx-converter` contains a parser which translates all common Word document entities into corresponding `kramdown` syntax. It extracts images and converts them into `.jpg` files with a maximum width or height of 800 pixels.\n\nOutput files and directories will be created according to the `webgen` conventions. This is useful when you want to generate a static website with the `webgen` gem after you have converted your `.docx` file into `html`. The file naming is in the format `ss.nnnn.ll.page`, where `ss` is a 2-digit sort number, `nnnn` is the main file name, `ll` is the language code. For more information on `webgen` see http://webgen.gettalong.org/\n\nSupported Word elements:\n\n* Paragraph\n* Line break\n* Page break\n* Bold\n* Italic\n* Paragraph styles \"Heading1\", \"Heading2\" and \"Title\"\n* Character styles \"Strong\" and \"Quote\"\n* Footnotes\n* Tables\n* Images including captions\n* Non-breaking spaces\n\nInstallation\n----------\n\nOn Debian Linux:\n\n`apt-get install libmagic-dev`\n`apt-get install libmagickwand-dev`\n`gem install docx_converter`\n\nLook into the .gemspec file to see all gem dependencies.\n\nInstallation may vary on other operating systems.\n\nUsage\n----------\n\nFrom the command line:\n\n`docx-converter` `inputfile` `format` `output_directory`\n\n`format` can be either `kramdown`, `html` or `latex`. For example:\n\n`docx-converter` `~/Downloads/testdoc1.docx` `latex` `/tmp/docxoutput`\n\n`output_directory` will be created if it doesn't exist. A subdirectory `/src` will be created by default, which is merely a convention to be identical with the `webgen` file system standard.\n\nIf you want to use `docx_converter` from a Ruby script, you can use the API like this:\n\n    r = DocxConverter::Render.new(options)\n    rendered_filepaths = r.render(:html)\n    \n`options` is a hash with the following keys\n\n* `:output_dir`: The directory to be created for the output files. A subdirectory `/src` will be created by default, which is merely a convention to be identical with the `webgen` file system standard.\n* `:inputfile`: The path to the `.docx` file to be parsed\n* `:image_subdir_filesystem`: The subdirectory name into which images will be put. It will be created below the `/src` subdirectory.\n* `:image_subdir_kramdown`: Usually this is identical to `:image_subdir_filesystem` and should only be different when you do further manual postprocessing with the kramdown output. This string will be added as a prefix for images in the final kramdown output. An example: `![image description](/image_subdir_kramdown/imagename.jpg)`.\n* `:language`: The language to be used for the generated file names. See `webgen` conventions above.\n* `:split_chapters`: when `true`, the output files will be split between headings which have the Word paragraph style \"Heading1\". This is useful for large documents. When `false`, no splitting is done and all content will be output to the file `01.chapter01.ll.page`. Footnotes will be split correctly into the various chapters.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaelfranzl%2Fdocx_converter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmichaelfranzl%2Fdocx_converter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaelfranzl%2Fdocx_converter/lists"}