An open API service indexing awesome lists of open source software.

https://github.com/netwrix/flarewell

Say goodbye to MadCap Flare and convert your project to markdown!
https://github.com/netwrix/flarewell

Last synced: 5 months ago
JSON representation

Say goodbye to MadCap Flare and convert your project to markdown!

Awesome Lists containing this project

README

          

# HTML to Markdown Converter - Claude Instructions

A Python tool that converts HTML documentation (particularly from MadCap Flare) to Markdown format while preserving folder structure and centralizing images with intelligent deduplication.

## Core Functionality

- **Input**: HTML files (`.html`, `.htm`, `.xhtml`)
- **Output**: Markdown files (`.md`)
- **Directory Structure**: Preserved except for images
- **Image Handling**: Centralized in `static/img/{productname}` directory
- **Filename Convention**: All lowercase with underscores replacing spaces
- **Path References**: Absolute paths from parent output directory

## Key Features

- Detects identical images using content hashing
- Stores only one copy of duplicate images
- Tracks usage in `image-manifest.json`

- Updates all internal `.html` links to `.md`
- Maintains anchor links between documents
- Resolves cross-file references automatically

- All images stored in `/static/img/{mirror_doc_directory}`
- One image folder per product
- Only referenced images are copied

## Installation & Setup

```bash
# 1. Clone repository
git clone [repository_url]

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install beautifulsoup4 markdownify
```

## Usage

```bash
python app.py /path/to/html/docs /path/to/output
```

```bash
python app.py /path/to/html/docs /path/to/output --verbose
```

## Output Structure

```
output/ # Specified output directory
├── Product1/ # Markdown files (structure preserved)
│ ├── guide/
│ │ └── intro.md
│ └── api/
│ └── reference.md
└── Product2/
└── docs/
└── overview.md

static/ # Parallel to output directory
└── img/ # Centralized images (not 'images')
├── image-manifest.json # Deduplication tracking
├── Product1/
│ ├── guide/
│ │ └── screenshot.png
│ └── api/
│ └── diagram.png
└── Product2/
└── docs/
└── logo.png
```

## Implementation Details

- Scan for images and build reference map
- Create anchor mappings for cross-references
- Build deduplication hash table

- Convert HTML to Markdown
- Update all link references
- Copy unique images to static directory
- Generate image-manifest.json

## Critical Requirements

- Never modify source files
- Preserve all internal links
- Handle MadCap Flare-specific HTML structures

- Maintain readable Markdown output
- Optimize image storage through deduplication
- Generate comprehensive image manifest

## Error Handling

- Log warning but continue processing
- Record in image-manifest.json
- Preserve image reference in Markdown

- Attempt best-effort conversion
- Log parsing errors with file path
- Continue with next file

- Check for existing files
- Option to overwrite or skip
- Log conflicts

## Performance Considerations

- **Expected Speed**: ~1-2 seconds per file
- **Memory Usage**: Scales with image deduplication table
- **Disk Usage**: Reduced through image deduplication
- **Large Documentation Sets**: Two-pass processing for efficiency

## Troubleshooting Guide

Image not referenced in HTML or missing from source

1. Verify image exists in source
2. Check if referenced in HTML
3. Review image-manifest.json
4. Confirm static/img structure

Cross-reference anchors not found

1. Check anchor mappings in verbose output
2. Verify target document exists
3. Confirm anchor ID consistency

## Command Reference

| Option | Type | Description | Default |
|--------|------|-------------|---------|
| `input_dir` | Required | Source HTML directory | - |
| `output_dir` | Required | Destination for Markdown | - |
| `--verbose, -v` | Flag | Show detailed progress | False |
| `--overwrite` | Flag | Overwrite existing files | False |
| `--skip-images` | Flag | Convert without copying images | False |

## Testing Checklist

- [ ] Basic HTML to Markdown conversion
- [ ] Image deduplication across multiple files
- [ ] Cross-file link resolution
- [ ] MadCap Flare specific elements
- [ ] Large documentation set performance
- [ ] Edge cases (empty files, broken HTML)

## Future Enhancements

- Support for custom CSS preservation
- Batch processing with progress bar
- Configuration file support
- Plugin system for custom transformations