Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/michaelfranzl/docx_converter
Ruby gem converting Word docx files into html or LaTeX via the Kramdown syntax
https://github.com/michaelfranzl/docx_converter
docx-to-markdown kramdown ruby
Last synced: 3 months ago
JSON representation
Ruby gem converting Word docx files into html or LaTeX via the Kramdown syntax
- Host: GitHub
- URL: https://github.com/michaelfranzl/docx_converter
- Owner: michaelfranzl
- Created: 2013-12-29T20:35:18.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2021-08-30T12:54:00.000Z (about 3 years ago)
- Last Synced: 2024-07-17T06:05:21.816Z (4 months ago)
- Topics: docx-to-markdown, kramdown, ruby
- Language: Ruby
- Homepage:
- Size: 18.6 KB
- Stars: 34
- Watchers: 4
- Forks: 16
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
docx-converter
=================This Ruby library (gem) parses and translates `.docx` Word documents into kramdown syntax, which allows for easy subsequent translation into `html` or `TeX` code via the excellent `kramdown` library. `kramdown` is a superset of `Markdown`. See http://kramdown.gettalong.org/ for more details.
A `.docx` file as written by modern versions of Microsoft Office is just a `.zip` file in disguise. It contains a directory tree containing XML files. Parsing of these compressed XML trees is rather staightforward, thanks to the `zip` and `nokogiri` Ruby libraries.
`docx-converter` contains a parser which translates all common Word document entities into corresponding `kramdown` syntax. It extracts images and converts them into `.jpg` files with a maximum width or height of 800 pixels.
Output files and directories will be created according to the `webgen` conventions. This is useful when you want to generate a static website with the `webgen` gem after you have converted your `.docx` file into `html`. The file naming is in the format `ss.nnnn.ll.page`, where `ss` is a 2-digit sort number, `nnnn` is the main file name, `ll` is the language code. For more information on `webgen` see http://webgen.gettalong.org/
Supported Word elements:
* Paragraph
* Line break
* Page break
* Bold
* Italic
* Paragraph styles "Heading1", "Heading2" and "Title"
* Character styles "Strong" and "Quote"
* Footnotes
* Tables
* Images including captions
* Non-breaking spacesInstallation
----------On Debian Linux:
`apt-get install libmagic-dev`
`apt-get install libmagickwand-dev`
`gem install docx_converter`Look into the .gemspec file to see all gem dependencies.
Installation may vary on other operating systems.
Usage
----------From the command line:
`docx-converter` `inputfile` `format` `output_directory`
`format` can be either `kramdown`, `html` or `latex`. For example:
`docx-converter` `~/Downloads/testdoc1.docx` `latex` `/tmp/docxoutput`
`output_directory` will be created if it doesn't exist. A subdirectory `/src` will be created by default, which is merely a convention to be identical with the `webgen` file system standard.
If you want to use `docx_converter` from a Ruby script, you can use the API like this:
r = DocxConverter::Render.new(options)
rendered_filepaths = r.render(:html)
`options` is a hash with the following keys* `:output_dir`: The directory to be created for the output files. A subdirectory `/src` will be created by default, which is merely a convention to be identical with the `webgen` file system standard.
* `:inputfile`: The path to the `.docx` file to be parsed
* `:image_subdir_filesystem`: The subdirectory name into which images will be put. It will be created below the `/src` subdirectory.
* `:image_subdir_kramdown`: Usually this is identical to `:image_subdir_filesystem` and should only be different when you do further manual postprocessing with the kramdown output. This string will be added as a prefix for images in the final kramdown output. An example: `![image description](/image_subdir_kramdown/imagename.jpg)`.
* `:language`: The language to be used for the generated file names. See `webgen` conventions above.
* `:split_chapters`: when `true`, the output files will be split between headings which have the Word paragraph style "Heading1". This is useful for large documents. When `false`, no splitting is done and all content will be output to the file `01.chapter01.ll.page`. Footnotes will be split correctly into the various chapters.