{"id":23456393,"url":"https://github.com/dfop02/html4docx","last_synced_at":"2026-02-28T01:55:39.824Z","repository":{"id":220834736,"uuid":"746849702","full_name":"dfop02/html4docx","owner":"dfop02","description":"Convert html to docx","archived":false,"fork":false,"pushed_at":"2024-08-26T14:43:55.000Z","size":39,"stargazers_count":7,"open_issues_count":4,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-03T09:20:56.967Z","etag":null,"topics":["docx","html","html2docx","python","python3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dfop02.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-22T19:46:51.000Z","updated_at":"2024-08-14T19:33:03.000Z","dependencies_parsed_at":"2024-12-24T04:30:00.667Z","dependency_job_id":"125ce56f-8f41-48de-a4ef-888afd27a318","html_url":"https://github.com/dfop02/html4docx","commit_stats":null,"previous_names":["dfop02/html4docx"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dfop02%2Fhtml4docx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dfop02%2Fhtml4docx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dfop02%2Fhtml4docx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dfop02%2Fhtml4docx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dfop02","download_url":"https://codeload.github.com/dfop02/html4docx/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248809040,"owners_count":21164893,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docx","html","html2docx","python","python3"],"created_at":"2024-12-24T04:28:43.006Z","updated_at":"2026-02-28T01:55:39.799Z","avatar_url":"https://github.com/dfop02.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HTML FOR DOCX\n![Tests](https://github.com/dfop02/html4docx/actions/workflows/tests.yml/badge.svg)\n[![PyPI Downloads](https://static.pepy.tech/personalized-badge/html-for-docx?period=total\u0026units=INTERNATIONAL_SYSTEM\u0026left_color=BLACK\u0026right_color=GREEN\u0026left_text=downloads)](https://pepy.tech/projects/html-for-docx)\n![Version](https://img.shields.io/pypi/v/html-for-docx.svg)\n![Supported Versions](https://img.shields.io/pypi/pyversions/html-for-docx.svg)\n\nConvert html to docx, this project is a fork from descontinued [pqzx/html2docx](https://github.com/pqzx/html2docx).\n\n## How install\n\n`pip install html-for-docx`\n\n## Usage\n\n#### The basic usage\n\nAdd HTML-formatted content to an existing `.docx` document\n\n```python\nfrom html4docx import HtmlToDocx\n\nparser = HtmlToDocx()\nhtml_string = '\u003ch1\u003eHello world\u003c/h1\u003e'\nparser.add_html_to_document(html_string, filename_docx)\n```\n\nYou can use `python-docx` to manipulate directly the file, here an example\n\n```python\nfrom docx import Document\nfrom html4docx import HtmlToDocx\n\ndocument = Document()\nparser = HtmlToDocx()\n\nhtml_string = '\u003ch1\u003eHello world\u003c/h1\u003e'\nparser.add_html_to_document(html_string, document)\n\ndocument.save('your_file_name.docx')\n```\n\nor incrementally add new html to document and save it when finished, new content will always be added at the end\n\n```python\nfrom docx import Document\nfrom html4docx import HtmlToDocx\n\ndocument = Document()\nparser = HtmlToDocx()\n\nfor part in ['First', 'Second', 'Third']:\n    parser.add_html_to_document(f'\u003ch1\u003e{part} Part\u003c/h1\u003e', document)\n\nparser.save('your_file_name.docx')\n```\n\nWhen you pass a `Document` object, you can either use `document.save()` from python-docx or `parser.save()` from html4docx, both works well.\n\nBoth supports saving it in-memory, using `BytesIO`.\n\n```python\nfrom io import BytesIO\nfrom docx import Document\nfrom html4docx import HtmlToDocx\n\nbuffer = BytesIO()\ndocument = Document()\nparser = HtmlToDocx()\n\nhtml_string = '\u003ch1\u003eHello world\u003c/h1\u003e'\nparser.add_html_to_document(html_string, document)\n\n# Save the document to the in-memory buffer\nparser.save(buffer)\n\n# If you need to read from the buffer again after saving,\n# you might need to reset its position to the beginning\nbuffer.seek(0)\n```\n\n#### Convert files directly\n\n```python\nfrom html4docx import HtmlToDocx\n\nparser = HtmlToDocx()\nparser.parse_html_file(input_html_file_path, output_docx_file_path)\n# You can also define a encoding, by default is utf-8\nparser.parse_html_file(input_html_file_path, output_docx_file_path, 'utf-8')\n```\n\n#### Convert files from a string\n\n```python\nfrom html4docx import HtmlToDocx\n\nparser = HtmlToDocx()\ndocx = parser.parse_html_string(input_html_file_string)\n```\n\n#### Change table styles\n\nTables are not styled by default. Use the `table_style` attribute on the parser to set a table style before convert html. The style is used for all tables.\n\n```python\nfrom html4docx import HtmlToDocx\n\nparser = HtmlToDocx()\nparser.table_style = 'Light Shading Accent 4'\ndocx = parser.parse_html_string(input_html_file_string)\n```\n\nTo add borders to tables, use the `Table Grid` style:\n\n```python\nparser.table_style = 'Table Grid'\n```\n\nAll table styles we support can be found [here](https://python-docx.readthedocs.io/en/latest/user/styles-understanding.html#table-styles-in-default-template).\n\n#### Options\n\nThere is 5 options that you can use to personalize your execution:\n- Disable Images: Ignore all images.\n- Disable Tables: If you do it, it will render just the raw tables content\n- Disable Styles: Ignore all CSS styles. Also disables Style-Map.\n- Disable Fix-HTML: Use BeautifulSoap to Fix possible HTML missing tags.\n- Disable Style-Map: Ignore CSS classes to word styles mapping\n- Disable Tag-Override: Ignore html tag overrides.\n- Disable HTML-Comments: Ignore all \"\u003c!-- ... --\u003e\" comments from HTML.\n\nThis is how you could disable them if you want:\n\n```python\nfrom html4docx import HtmlToDocx\n\nparser = HtmlToDocx()\nparser.options['images'] = False # Default True\nparser.options['tables'] = False # Default True\nparser.options['styles'] = False # Default True\nparser.options['fix-html'] = False # Default True\nparser.options['html-comments'] = False # Default False\nparser.options['style-map'] = False # Default True\nparser.options['tag-override'] = False # Default True\ndocx = parser.parse_html_string(input_html_file_string)\n```\n\n## Extended Styling Features\n\n### CSS Class to Word Style Mapping\n\nMap HTML CSS classes to Word document styles:\n\n```python\nfrom html4docx import HtmlToDocx\n\nstyle_map = {\n    'code-block': 'Code Block',\n    'numbered-heading-1': 'Heading 1 Numbered',\n    'finding-critical': 'Finding Critical'\n}\n\nparser = HtmlToDocx(style_map=style_map)\nparser.add_html_to_document(html, document)\n```\n\n### Tag Style Overrides\n\nOverride default tag-to-style mappings:\n\n```python\ntag_overrides = {\n    'h1': 'Custom Heading 1',  # All \u003ch1\u003e use this style\n    'pre': 'Code Block'        # All \u003cpre\u003e use this style\n}\n\nparser = HtmlToDocx(tag_style_overrides=tag_overrides)\n```\n\n**Custom styles from a Word template:** Use a document created from a `.docx` that already defines the styles (e.g. \"Code Block\", \"Custom Markdown\"). Pass **that same document** to the parser and save it so the custom styles are preserved:\n\n```python\nfrom docx import Document\nfrom html4docx import HtmlToDocx\n\ndoc = Document(\"path/to/template.docx\")  # template has Code Block, Custom Markdown, etc.\nparser = HtmlToDocx(tag_style_overrides={\"code\": \"Custom Markdown\", \"pre\": \"Code Block\"})\nparser.add_html_to_document(html, doc)\ndoc.save(\"output.docx\")  # save the template-based doc so custom styles are preserved\n```\n\nIf you save a different document (for example, by creating a new `Document()` instead of loading your template), the output file will **not** contain the template’s custom styles.\n\nIf a referenced custom style does not exist in the document at generation time, a warning will be logged to help you detect the missing style.\n\n### Default Paragraph Style\n\nSet custom default paragraph style:\n\n```python\n# Use 'Body' as default (default behavior)\nparser = HtmlToDocx(default_paragraph_style='Body')\n\n# Use Word's default 'Normal' style\nparser = HtmlToDocx(default_paragraph_style=None)\n```\n\n### Inline CSS Styles\n\nFull support for inline CSS styles on any element:\n\n```html\n\u003cp style=\"color: red; font-size: 14pt\"\u003eRed 14pt paragraph\u003c/p\u003e\n\u003cspan style=\"font-weight: bold; color: blue\"\u003eBold blue text\u003c/span\u003e\n```\n\nSupported CSS properties:\n\n- color\n- font-size\n- font-weight (bold)\n- font-style (italic)\n- text-decoration (underline, line-through)\n- font-family\n- text-align\n- background-color\n- Border (for tables)\n- Verticial Align (for tables)\n\n### !important Flag Support\n\nProper CSS precedence with !important:\n\n```html\n\u003cspan style=\"color: gray\"\u003e\n  Gray text with \u003cspan style=\"color: red !important\"\u003ered important\u003c/span\u003e.\n\u003c/span\u003e\n```\n\nThe !important flag ensures highest priority.\n\n### Style Precedence Order\n\nStyles are applied in this order (lowest to highest priority):\n\n1. Base HTML tag styles (`\u003cb\u003e`, `\u003cem\u003e`, `\u003ccode\u003e`)\n2. Parent span styles\n3. CSS class-based styles (from `style_map`)\n4. Inline CSS styles (from `style` attribute)\n5. !important inline CSS styles (highest priority)\n\n#### Metadata\n\nYou're able to read or set docx metadata:\n\n```python\nfrom docx import Document\nfrom html4docx import HtmlToDocx\n\ndocument = Document()\nparser = HtmlToDocx()\nparser.set_initial_attrs(document)\nmetadata = parser.metadata\n\n# You can get metadata as dict\nmetadata_json = metadata.get_metadata()\nprint(metadata_json['author']) # Jane\n# or just print all metadata if if you want\nmetadata.get_metadata(print_result=True)\n\n# Set new metadata\nmetadata.set_metadata(author=\"Jane\", created=\"2025-07-18T09:30:00\")\ndocument.save('your_file_name.docx')\n```\n\nYou can find all available metadata attributes [here](https://python-docx.readthedocs.io/en/latest/dev/analysis/features/coreprops.html).\n\n### Why\n\nMy goal in forking and fixing/updating this package was to complete my current task at work, which involves converting HTML to DOCX. The original package lacked a few features and had some bugs, preventing me from completing the task. Instead of creating a new package from scratch, I preferred to update this one.\n\n### Differences (fixes and new features)\n\n**Fixes**\n- Fix `table_style` not working | [Dfop02](https://github.com/dfop02) from [Issue](https://github.com/dfop02/html4docx/issues/11)\n- Handle missing run for leading br tag | [dashingdove](https://github.com/dashingdove) from [PR](https://github.com/pqzx/html2docx/pull/53)\n- Fix base64 images | [djplaner](https://github.com/djplaner) from [Issue](https://github.com/pqzx/html2docx/issues/28#issuecomment-1052736896)\n- Handle img tag without src attribute | [johnjor](https://github.com/johnjor) from [PR](https://github.com/pqzx/html2docx/pull/63)\n- Fix bug when any style has `!important` | [Dfop02](https://github.com/dfop02)\n- Fix 'style lookup by style_id is deprecated.' | [Dfop02](https://github.com/dfop02)\n- Fix `background-color` not working | [Dfop02](https://github.com/dfop02)\n- Fix crashes when img or bookmark is created without paragraph | [Dfop02](https://github.com/dfop02)\n- Fix Ordered and Unordered Lists | [TaylorN15](https://github.com/TaylorN15) from [PR](https://github.com/dfop02/html4docx/pull/16)\n- Fixed styles was only being applied to span tag. | [Dfop02](https://github.com/dfop02) from [Issue](https://github.com/dfop02/html4docx/issues/40)\n- Fixed bug on styles parsing when style contains multiple colon. | [Dfop02](https://github.com/dfop02)\n- Fixed highlighting a single word | [Lynuxen](https://github.com/Lynuxen)\n- Fix color parsing failing due to invalid colors, falling back to black. | [dfop02](https://github.com/dfop02) from [Issue](https://github.com/dfop02/html4docx/issues/53)\n\n**New Features**\n- Add Witdh/Height style to images | [maifeeulasad](https://github.com/maifeeulasad) from [PR](https://github.com/pqzx/html2docx/pull/29)\n- Support px, cm, pt, in, rem, em, mm, pc and % units for styles | [Dfop02](https://github.com/dfop02)\n- Improve performance on large tables | [dashingdove](https://github.com/dashingdove) from [PR](https://github.com/pqzx/html2docx/pull/58)\n- Support for HTML Pagination | [Evilran](https://github.com/Evilran) from [PR](https://github.com/pqzx/html2docx/pull/39)\n- Support Table style | [Evilran](https://github.com/Evilran) from [PR](https://github.com/pqzx/html2docx/pull/39)\n- Support alternative encoding | [HebaElwazzan](https://github.com/HebaElwazzan) from [PR](https://github.com/pqzx/html2docx/pull/59)\n- Support colors by name | [Dfop02](https://github.com/dfop02)\n- Support font_size when text, ex.: small, medium, etc. | [Dfop02](https://github.com/dfop02)\n- Support to internal links (Anchor) | [Dfop02](https://github.com/dfop02)\n- Support to rowspan and colspan in tables. | [Dfop02](https://github.com/dfop02) from [Issue](https://github.com/dfop02/html4docx/issues/25)\n- Support to 'vertical-align' in table cells. | [Dfop02](https://github.com/dfop02)\n- Support to metadata | [Dfop02](https://github.com/dfop02)\n- Add support to table cells style (border, background-color, width, height, margin) | [Dfop02](https://github.com/dfop02)\n- Being able to use inline images on same paragraph. | [Dfop02](https://github.com/dfop02)\n- Refactory Tests to be more consistent and less 'human validation' | [Dfop02](https://github.com/dfop02)\n- Support for common CSS properties for text | [Lynuxen](https://github.com/Lynuxen)\n- Support for CSS classes to Word Styles | [raithedavion](https://github.com/raithedavion)\n- Support for HTML tag style overrides | [raithedavion](https://github.com/raithedavion)\n\n## To-Do\n\nThese are the ideas I'm planning to work on in the future to make this project even better:\n\n- Add support for the `\u003cstyle\u003e` tag, including all classes, and ensure they are correctly applied throughout the file.\n- Add support for the `\u003clink\u003e` tag to load external CSS and apply it properly across the file.\n\n## Known Issues\n\n- **Maximum Nesting Depth:** Ordered lists support up to 3 nested levels. Any additional depth beyond level 3 will be treated as level 3.\n- **Counter Reset Behavior:**\n  - At level 1, starting a new ordered list will reset the counter.\n  - At levels 2 and 3, the counter will continue from the previous item unless explicitly reset.\n\n## Project Guidelines\n\nThis project is primarily designed for compatibility with Microsoft Word, but it currently works well with LibreOffice and Google Docs, based on our testing. The goal is to maintain this cross-platform harmony while continuing to implement fixes and updates.\n\n\u003e ⚠️ However, please note that Microsoft Word is the priority. Bugs or issues specific to other editors (e.g., LibreOffice or Google Docs) may be considered, but fixing them is secondary to maintaining full compatibility with Word.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdfop02%2Fhtml4docx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdfop02%2Fhtml4docx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdfop02%2Fhtml4docx/lists"}