https://github.com/judy2k/doc-to-md
Create clean Markdown (including code blocks!) from exported Google Docs.
https://github.com/judy2k/doc-to-md
contentstack markdown python
Last synced: 2 months ago
JSON representation
Create clean Markdown (including code blocks!) from exported Google Docs.
- Host: GitHub
- URL: https://github.com/judy2k/doc-to-md
- Owner: judy2k
- License: apache-2.0
- Created: 2023-12-21T14:51:34.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-24T15:07:45.000Z (about 2 years ago)
- Last Synced: 2026-04-27T22:30:09.895Z (2 months ago)
- Topics: contentstack, markdown, python
- Language: Python
- Homepage:
- Size: 2.86 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Doc-to-Markdown
This is a small command-line utility for converting Google Docs documents to
Markdown that's suitable for pasting into ContentStack.
## What does it actually do?
`doc2md` can convert HTML exported from Google Docs into clean Markdown documents for pasting into ContentStack, with the following features:
- Code!
- Firstly, the script will identify any code blocks (formatted using Fira Code, Roboto Mono, Source Code Pro, or Courier New) and will mark them as code blocks in the resulting Markdown.
- Some basic heuristics are used to annotate code blocks as python code
- Inline code can be correctly identified using backticks (the same as Markdown itself) or formatting (any spans marked with a code font).
- **Code building blocks** are now supported!
- Empty paragraphs are removed
- Hyperlinks are correctly extracted from Google's nasty tracking links.
- Bold and italic formatting is maintained where possible.
- Hide (optionally?) CSS parse warnings.
- Supports tables!
## Not (currently) Supported
- Images - I can't currently think of a good way to make image upload into ContentStack more seamless, without API access to ContentStack itself.
- See the [To Do](#to-do) section.
## Installation
```
python -m pip install --upgrade git+https://github.com/judy2k/doc-to-md.git
# Check that it worked:
doc2md --help
```
## Usage
The tool doesn't have many options, so using it is relatively straightforward.
First, download your Google Doc as a Web Page.

Unzip the archive, and then in the command-line, run something like the following:
```
# Create a new Markdown file from an existing Google Docs HTML file:
doc2md /PATH/TO/INPUT.HTML /PATH/TO/OUTPUT.MD
```
This should produce a clean, formatted Markdown file, suitable for copying into ContentStack.
You will, sadly, still have to import all your images and insert them in the correct locations yourself.
### Tables
Tables are supported by doc2md, and are exported to GFM table format.
This hasn't been _widely_ tested.
If your table doesn't have a header row, then a blank one will be inserted,
which is probably not what you want.
To mark a header row in Google Docs,
hover over the row and click on the pin icon that appears to the left.

## To-Do
- Improve the code that identifies and merges code blocks. ([#3](https://github.com/judy2k/doc-to-md/issues/3)).
- ContentStack doesn't support `--` and `---` so replace them (outside of code blocks!) with n-dash and m-dash characters.
- Resulting Markdown occasionally includes a backslash followed by line-break character. Need to identify why it's happening and fix.
- Is there a way to manage images better?
- Can captions in the doc automatically be applied to the associated image?
- Ensure backticks aren't messed up inside code blocks.
--------
Made with 💚 for my colleagues at MongoDB, by Judy2k.