https://github.com/jsade/atlas-markdown

Atlas Markdown is a Python-based tool that downloads and converts Atlassian product documentation into clean Markdown files optimized for tools like Obsidian.
https://github.com/jsade/atlas-markdown

atlassian atlassian-confluence atlassian-jira atlassian-utilities confluence documentation jira markdown obsidian obsidian-md

Last synced: about 2 months ago
JSON representation

Atlas Markdown is a Python-based tool that downloads and converts Atlassian product documentation into clean Markdown files optimized for tools like Obsidian.

Host: GitHub
URL: https://github.com/jsade/atlas-markdown
Owner: jsade
License: gpl-3.0
Created: 2025-06-13T06:55:58.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2026-04-26T01:07:54.000Z (2 months ago)
Last Synced: 2026-04-30T16:34:27.477Z (about 2 months ago)
Topics: atlassian, atlassian-confluence, atlassian-jira, atlassian-utilities, confluence, documentation, jira, markdown, obsidian, obsidian-md
Language: Python
Homepage: https://github.com/jsade/atlas-markdown
Size: 1.98 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 8
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

Atlas Markdown

Obsidian

⭐ A robust command-line tool for transforming [Atlassian online product documentation](https://support.atlassian.com) into a clean, organized Markdown site on your local filesystem.

Built specifically for use with [Obsidian](https://obsidian.md/), though any markdown viewer will work.

## Features

- **Autonomous Operation** - Set it and forget it. The script handles everything from discovery to final output
- **Resume Capability** - Interrupt anytime and continue where you left off
- **Image Handling** - Downloads all images and updates references automatically
- **Accurate Content Extraction** - Handles React SPAs and dynamic content with Playwright
- **Clean Markdown** - Converts HTML to linted and well-formatted Markdown
- **Managed Safeguards** - Rate limiting, circuit breakers, and configurable constraints
- **Auto-Tagging** - Automatically generates contextual tags using semantic content analysis for better organization in Obsidian

## Requirements

- Python 3.11+
- macOS (optimized for) or Linux. Windows might work :shrug:

## Quick Start

> [!CAUTION]
> Pre-release. Iterative changes to be expected.

Download the wheel file from the [latest release](https://github.com/jsade/atlas-markdown/releases/latest) (e.g., `atlas_markdown-0.1.1-py3-none-any.whl`).

```bash
# 1. Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate

# 2. Install from wheel
pip install atlas_markdown-*.whl

# 3. Install browser for web scraping
playwright install chromium

# 4. Run the tool (choose one of the following methods):

# Option A: Using command-line argument
atlas-markdown -u "https://support.atlassian.com/confluence-cloud"

# Option B: Using environment variable
export ATLAS_MD_BASE_URL="https://support.atlassian.com/confluence-cloud"
atlas-markdown

# Valid product endpoints:
# - jira-service-management-cloud
# - jira-software-cloud
# - confluence-cloud
# - jira-work-management
# - trello
# - bitbucket-cloud
# - statuspage
```

For development installation options, see [CONTRIBUTING.md](CONTRIBUTING.md#installation-options).

## How It Works

The script operates in 7 distinct phases:

1. **Discovery** - Extracts page hierarchy from React state or sitemap
2. **Page Fetching** - Downloads pages using Playwright for JavaScript rendering
3. **Image Download** - Fetches all referenced images asynchronously
4. **Retry Failed** - Attempts to re-fetch any failed pages
5. **Index Generation** - Creates navigation index of all content
6. **Link Resolution** - Converts wiki-style links to file references
7. **Markdown Linting** - Cleans up and standardizes formatting

Click to open a visual representation

```mermaid
flowchart LR
subgraph Phase1["1 Discovery"]
D1[React State
Extraction]
D2[Sitemap
Parsing]
end

subgraph Phase2["2 Page Scraping"]
PS1[Playwright
Rendering]
PS2[Content
Extraction]
end

subgraph Phase3["3 Image Download"]
ID1[Async
Fetching]
ID2[Path
Rewriting]
end

subgraph Phase4["4 Retry Failed"]
RF1[Error
Detection]
RF2[Smart
Retries]
end

subgraph Phase5["5 Index Generation"]
IG1[Navigation
Structure]
IG2[TOC
Creation]
end

subgraph Phase6["6 Link Resolution"]
LR1[Wiki-style
Links]
LR2[File
References]
end

subgraph Phase7["7 Markdown Linting"]
ML1[Format
Cleanup]
ML2[Standardization]
end

Start([🚀]) --> Phase1
Phase1 --> Phase2
Phase2 --> Phase3
Phase3 --> Phase4
Phase4 --> Phase5
Phase5 --> Phase6
Phase6 --> Phase7
Phase7 --> End([✅])

%% Styling
classDef phaseBox fill:#030,stroke:#060,stroke-width:1px,color:#fff
classDef innerNode fill:#020,stroke:#2c5aa0,stroke-width:1px,color:#fff,font-size:11px
classDef startEnd fill:#2ecc71,stroke:#27ae60,stroke-width:3px,color:#fff

class Phase1,Phase2,Phase3,Phase4,Phase5,Phase6,Phase7 phaseBox
class D1,D2,PS1,PS2,ID1,ID2,RF1,RF2,IG1,IG2,LR1,LR2,ML1,ML2 innerNode
class Start,End startEnd
```

Click to open an example site output

```plaintext
output/
├── index.md # Main navigation index
├── docs/ # Documentation pages
│ ├── Getting started.md
│ ├── Administration/
│ │ ├── Overview.md
│ │ └── User management.md
│ └── ...
├── resources/ # Resource pages (if enabled)
│ └── ...
├── images/ # Downloaded images
│ └── [organized by page]
└── linting_report.md # Markdown formatting report
```

### Auto-Tagging

Each generated markdown file includes enhanced frontmatter with automatically generated tags based on the page's hierarchical position in the documentation. This helps with organization and navigation in tools like Obsidian.

Example frontmatter:
```yaml
---
url: https://support.atlassian.com/jira-service-management-cloud/docs/manage-users/
scrape_date: 2025-01-25T10:30:00
tags:
- jira-service-management-cloud
- user-management
atlas_md_version: 0.4.0
atlas_md_url: https://github.com/jsade/atlas-markdown
atlas_md_product: jira-service-management-cloud
atlas_md_category: Administration
atlas_md_section: User Management
---
```

The auto-tagging feature:
- Intelligently categorizes pages based on content (e.g., user-management, api, security, troubleshooting)
- Includes the product name as the first tag
- Generates 2-3 focused tags rather than long page slugs
- Can be disabled by setting `ATLAS_MD_DISABLE_TAGS=true`
- Atlas Markdown metadata (`atlas_md_*` fields) are always included regardless of tag settings

Common tag categories:
- `getting-started` - Overview and introduction pages
- `administration` - Configuration and settings pages
- `user-management` - User, team, and permission pages
- `api` - API reference and integration pages
- `security` - Authentication and security pages
- `automation` - Workflow and automation pages
- `troubleshooting` - Error and problem-solving pages

### State Management

The script uses SQLite to track:

- Page fetching status and metadata
- Image download progress
- Failed pages for retry
- Session information

This enables seamless resumption after interruptions.

### Markdown Linting Rules

The script includes an automatic markdown linter that fixes common formatting issues. The linter generates a report (`linting_report.md`) summarizing all fixes applied across your documentation.

Click to open list of special linting rules

The following rules are applied during the linting phase (can be skipped with `--no-lint`):

- Content Structure
- Remove content before H1 - Ensures documents start with a proper H1 heading
- Fix multi-line wiki links - Consolidates wiki links that span multiple lines into single-line format
- Link Formatting
- Convert internal links to wiki-style - Internal markdown links `[text](file.md)` become `[[file|text]]`
- Preserve external links - HTTP/HTTPS links remain in standard markdown format `[text](url)`
- Panel Conversion
- Panel admonitions will be transformed to Obsidian [callout](https://help.obsidian.md/callouts) format
- Supports types "info", "warning", "error", "note", and "success"
- Table Formatting
- Add missing table headers - Inserts header rows for tables that lack them
- Fix table separators - Ensures proper table formatting with separator lines
- Heading Rules
- Enforce heading spacing - Adds blank lines before and after headings for consistency
- Preserve heading hierarchy - Maintains proper H1-H6 structure
- Whitespace Management
- Remove trailing whitespace - Cleans line endings
- Reduce multiple blank lines - Replaces 3+ consecutive blank lines with 2
- Ensure final newline - Adds newline at end of file if missing
- HTML Conversion
- Convert inline HTML - Transforms common HTML tags to markdown:
- `
` → line break
- ``, `` → `**bold**`
- ``, `` → `*italic*`
- `` → `` `code` `` - List Formatting - Fix list indentation - Ensures list items start at column 0 - Remove empty lines between list items - Creates compact lists - Fix numbered list sequences - Renumbers lists to be consecutive (1, 2, 3...)

### Helpful Logging

- Supports various log levels (DEBUG, INFO, WARNING, ERROR)
- Timestamped log output to LOG_DIR

Click to open a screenshot

![Logs](docs/images/logging.png)

## Usage

```bash
atlas-markdown
```

### Command Line Options

| Option | Short | Description | Default |
|--------|-------|-------------|---------|
| `--output` | `-o` | Output directory for documentation | `./output` |
| `--workers` | `-w` | Number of concurrent workers | `5` |
| `--delay` | `-d` | Delay between requests (seconds) | `1.5` |
| `--base-url` | `-u` | Base URL for documentation (overrides env var) | From `ATLAS_MD_BASE_URL` |
| `--resume` | | Resume from previous state | `False` |
| `--dry-run` | | Preview without downloading | `False` |
| `--no-lint` | | Skip markdown linting phase | `False` |
| `--exclude-resources` | | Exclude `/resources/` pages (only fetch `/docs/`) | `False` |
| `--create-redirect-stubs` | | Create stub files for redirected URLs | `False` |
| `--no-h1-headings` | | Remove H1 headings from markdown output | `False` |
| `--verbose` | `-V` | Enable verbose output | `False` |
| `--version` | `-v` | Print version and exit | |

### Configuration

The script uses environment variables for configuration. All variables should be prefixed with `ATLAS_MD_` to avoid conflicts with other applications.

Add these to your shell profile (`~/.zshrc`, `~/.bashrc`, etc.):

```bash
# REQUIRED: Base URL for the documentation to fetch
# Must start with "https://support.atlassian.com/" and include a specific product endpoint
export ATLAS_MD_BASE_URL="https://support.atlassian.com/confluence-cloud"

# Optional: Override default settings as needed
export ATLAS_MD_OUTPUT_DIR="./output" # Where to save documentation
export ATLAS_MD_WORKERS="5" # Number of concurrent fetchers
export ATLAS_MD_REQUEST_DELAY="1.5" # Seconds between requests
```

After adding these variables, reload your shell configuration:
```bash
source ~/.zshrc # or ~/.bashrc
```

#### Environment Variable Reference

| Variable | Description | Default | Valid Values |
|----------|-------------|---------|-------------|
| **Required Configuration** |
| `ATLAS_MD_BASE_URL` | The Atlassian documentation URL to fetch. Must include a specific product endpoint. | **Required - No default** | `https://support.atlassian.com/{product}` |
| **Output Directory** |
| `ATLAS_MD_OUTPUT_DIR` | Directory where the fetched documentation will be saved | `./output` | Any valid directory path. |
| **Performance Settings** |
| `ATLAS_MD_WORKERS` | Number of concurrent scraping workers | `5` | `1-50` (higher = faster but more resource intensive) |
| `ATLAS_MD_REQUEST_DELAY` | Delay between requests in seconds | `1.5` | `0.1-60` (lower = faster but may trigger rate limits) |
| `ATLAS_MD_USER_AGENT` | User agent string for HTTP requests | Mozilla/5.0... | Any valid user agent string |
| **Logging Configuration** |
| `ATLAS_MD_LOG_LEVEL` | Logging verbosity level | `INFO` | `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL` |
| `ATLAS_MD_LOG_ENABLED` | Enable file logging | `false` | `true`, `false` |
| `ATLAS_MD_LOG_DIR` | Directory for log files | `logs/` | Any valid directory path |
| **Safety Constraints** |
| `ATLAS_MD_MAX_CRAWL_DEPTH` | Maximum depth to follow links from entry point | `5` | `0-10` (0 = unlimited, not recommended) |
| `ATLAS_MD_MAX_PAGES` | Maximum total pages to fetch | `1500` | `0+` (0 = unlimited, use with caution) |
| `ATLAS_MD_MAX_RUNTIME_MINUTES` | Maximum runtime in minutes before stopping | `120` | `0+` (0 = unlimited) |
| `ATLAS_MD_MAX_FILE_SIZE_MB` | Skip files larger than this size | `50` | `1+` MB |
| `ATLAS_MD_DOMAIN_RESTRICTION` | Control which URLs to follow | `product` | See below |
| `ATLAS_MD_MAX_RETRIES` | Maximum retry attempts for failed pages | `3` | `0-10` |
| `ATLAS_MD_MAX_CONSECUTIVE_FAILURES` | Stop after this many consecutive failures | `20` | `5+` |
| `ATLAS_MD_DRY_RUN_DEFAULT` | Enable dry run mode by default | `false` | `true`, `false` |
| **Content Processing** |
| `ATLAS_MD_NO_H1_HEADINGS` | Remove H1 headings from output | `false` | `true`, `false` |
| `ATLAS_MD_DISABLE_TAGS` | Disable automatic tag generation | `false` | `true`, `false` |
| `ATLAS_MD_ENABLE_CONTENT_ANALYSIS` | Enable semantic content analysis for enhanced tagging | `true` | `true`, `false` |
| `ATLAS_MD_MAX_TAGS` | Maximum tags per page | `10` | `1-20` |
| `ATLAS_MD_MIN_TERM_FREQUENCY` | Minimum occurrences for technical terms | `3` | `1-10` |
| `ATLAS_MD_TECHNICAL_PATTERNS` | Enable technical pattern detection | `true` | `true`, `false` |

#### Domain Restriction Modes

- **`product`** (default) - Only fetch URLs under your specific product. For example, if `ATLAS_MD_BASE_URL` is set to `.../jira-service-management-cloud/`, only URLs starting with that path will be fetched.
- **`any-atlassian`** - Allow fetching any `support.atlassian.com` URL. Warning: This may result in fetching documentation from other Atlassian products.
- **`off`** - No domain restrictions (not recommended). The crawler could potentially follow links to external sites.

#### Supported Atlassian Products

- `jira-service-management-cloud`
- `jira-software-cloud`
- `confluence-cloud`
- `jira-work-management`
- `trello`
- `bitbucket-cloud`
- `statuspage`

### Usage Examples

```bash
# Fetch Confluence documentation using command-line option
atlas-markdown -u "https://support.atlassian.com/confluence-cloud" -o ./confluence-docs

# Fetch Jira documentation with custom settings
atlas-markdown -u "https://support.atlassian.com/jira-software-cloud" \
--output ./jira-docs \
--workers 3 \
--delay 2.0

# Resume interrupted fetch
atlas-markdown --resume

# Dry run to preview what would be fetched
atlas-markdown -u "https://support.atlassian.com/trello" --dry-run

# Using environment variable instead of command-line option
export ATLAS_MD_BASE_URL="https://support.atlassian.com/bitbucket-cloud"
atlas-markdown --output ./bitbucket-docs
```

## Troubleshooting

**Script hangs on "Initializing browser"**
- Run `playwright install chromium` to ensure browser is installed

**"Too many requests" errors**
- Increase `ATLAS_MD_REQUEST_DELAY` environment variable
- Reduce `ATLAS_MD_WORKERS` count

**Out of memory errors**
- Reduce `ATLAS_MD_WORKERS` count
- Enable verbose mode to identify memory-heavy pages

**Resume not working**
- Ensure `atlas_md_fetch_state.db` exists and is not corrupted
- Check file permissions on output directory

## Responsible Use

This tool is designed for legitimate documentation archival and offline access. To use it responsibly:

1. **Respect Rate Limits** - The default 1.5 second delay between requests is configured to be respectful
2. **Check robots.txt** - Atlassian's robots.txt currently allows scraping of documentation pages
3. **Personal Use** - Use downloaded content for personal reference, not redistribution

## License

This project is licensed under the GNU General Public License v3.0 - see the [LICENSE](LICENSE) file for details.

## Disclaimer

This project is an independent open-source tool and is not affiliated with, endorsed by, or sponsored by Atlassian Corporation Pty Ltd.

Click to open full disclaimer

_All Atlassian product names, logos, and brands mentioned in this repository (including but not limited to Jira, Confluence, Bitbucket, Trello, and Statuspage) are the property of Atlassian Corporation Pty Ltd. All product and company names are trademarks™ or registered® trademarks of their respective holders. Use of them does not imply any affiliation with or endorsement by them._

_This tool is designed for personal use to create offline copies of publicly available documentation. Users are responsible for complying with Atlassian's Terms of Service and any applicable usage policies when accessing their documentation._

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jsade/atlas-markdown

Awesome Lists containing this project

README