{"id":26038881,"url":"https://github.com/twardoch/twars-url2md","last_synced_at":"2025-04-09T23:52:27.783Z","repository":{"id":275619162,"uuid":"926648536","full_name":"twardoch/twars-url2md","owner":"twardoch","description":"Rust tool to fetch multiple URLs and save them into Markdown files","archived":false,"fork":false,"pushed_at":"2025-04-07T17:19:03.000Z","size":169,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-09T23:52:22.671Z","etag":null,"topics":["html2markdown","html2md","llm-context","markdown","rust","rust-lang","url2markdown"],"latest_commit_sha":null,"homepage":"https://crates.io/crates/twars-url2md","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/twardoch.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-03T16:18:46.000Z","updated_at":"2025-04-08T08:41:43.000Z","dependencies_parsed_at":"2025-02-03T17:34:21.175Z","dependency_job_id":"8a98d81a-d7c6-4a9b-b447-9b53f705a597","html_url":"https://github.com/twardoch/twars-url2md","commit_stats":null,"previous_names":["twardoch/twars-url2md"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twardoch%2Ftwars-url2md","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twardoch%2Ftwars-url2md/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twardoch%2Ftwars-url2md/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twardoch%2Ftwars-url2md/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/twardoch","download_url":"https://codeload.github.com/twardoch/twars-url2md/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248131454,"owners_count":21052819,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["html2markdown","html2md","llm-context","markdown","rust","rust-lang","url2markdown"],"created_at":"2025-03-07T10:38:03.278Z","updated_at":"2025-04-09T23:52:27.772Z","avatar_url":"https://github.com/twardoch.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# twars-url2md\n\n[![Crates.io](https://img.shields.io/crates/v/twars-url2md)](https://crates.io/crates/twars-url2md)\n![GitHub Release Date](https://img.shields.io/github/release-date/twardoch/twars-url2md)\n![GitHub commits since latest release](https://img.shields.io/github/commits-since/twardoch/twars-url2md/latest)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n**`twars-url2md`** is a fast and robust command-line tool written in Rust that fetches web pages, cleans up their HTML content, and converts them into clean Markdown.\n\nYou can drop a text that contains URLs onto the app, and it will find all the URLs and save Markdown versions of the pages in a logical folder structure. The output is not perfect, but the tool is fast and robust.\n\n## 1. Table of Contents\n\n- [Features](#features)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Examples](#examples)\n- [Development](#development)\n- [License](#license)\n- [Author](#author)\n\n## 2. Features\n\n### 2.1. Powerful Web Content Conversion\n\n- Extracts clean web content using Monolith\n- Converts web pages to Markdown efficiently\n- Handles complex URL and encoding scenarios\n\n### 2.2. Smart URL Handling\n\n- Extracts URLs from various text formats\n- Resolves and validates URLs intelligently\n- Supports base URL and relative link processing\n- **NEW**: Processes local HTML files in addition to remote URLs\n\n### 2.3. Flexible Input \u0026 Output\n\n- Multiple input methods (file, stdin, CLI)\n- Organized Markdown file generation\n- Cross-platform compatibility\n- **NEW**: Option to pack all Markdown outputs into a single combined file\n\n### 2.4. Advanced Processing\n\n- Parallel URL processing\n- Robust error handling\n- Exponential backoff retry mechanism for network requests\n\n## 3. Installation\n\n### 3.1. Download Pre-compiled Binaries\n\nThe easiest way to get started is to download the pre-compiled binary for your platform.\n\n1. Visit the [releases page](https://github.com/twardoch/twars-url2md/releases)\n2. Download the appropriate file for your system:\n   - **macOS**: `twars-url2md-macos-universal.tar.gz` (works on both Intel and Apple Silicon)\n   - **Windows**: `twars-url2md-windows-x86_64.exe.zip`\n   - **Linux**: `twars-url2md-linux-x86_64.tar.gz`\n3. Extract the archive:\n   - **macOS/Linux**: `tar -xzf twars-url2md-*.tar.gz`\n   - **Windows**: Extract the zip file using Explorer or any archive utility\n4. Make the binary executable (macOS/Linux only): `chmod +x twars-url2md`\n5. Move the binary to a location in your PATH:\n   - **macOS/Linux**: `sudo mv twars-url2md /usr/local/bin/` or `mv twars-url2md ~/.local/bin/`\n   - **Windows**: Move to a folder in your PATH or add the folder to your PATH\n\n### 3.2. Install from Crates.io\n\nIf you have Rust installed (version 1.70.0 or later), you can install directly from crates.io:\n\n```bash\ncargo install twars-url2md\n```\n\n### 3.3. Build from Source\n\nFor the latest version or to customize the build:\n\n```bash\n# Clone the repository\ngit clone https://github.com/twardoch/twars-url2md.git\ncd twars-url2md\n\n# Build and install\ncargo build --release\nmv target/release/twars-url2md /usr/local/bin/  # or any location in your PATH\n```\n\n## 4. Usage\n\n### 4.1. Command Line Options\n\n```\nUsage: twars-url2md [OPTIONS]\n\nOptions:\n  -i, --input \u003cFILE\u003e       Input file containing URLs or local file paths (one per line)\n  -o, --output \u003cDIR\u003e       Output directory for markdown files\n      --stdin              Read URLs from standard input\n      --base-url \u003cURL\u003e     Base URL for resolving relative links\n  -p, --pack \u003cFILE\u003e        Output file to pack all markdown files together\n  -v, --verbose            Enable verbose output\n  -h, --help               Print help\n  -V, --version            Print version\n```\n\n### 4.2. Input Options\n\nThe tool accepts URLs and local file paths from:\n\n- A file specified with `--input`\n- Standard input with `--stdin`\n- **Note:** Either `--input` or `--stdin` must be specified\n\n### 4.3. Output Options\n\n- `--output \u003cDIR\u003e`: Create individual Markdown files in this directory\n- `--pack \u003cFILE\u003e`: Combine all Markdown files into a single output file\n- You can use both options together\n\n### 4.4. Processing Local Files\n\nYou can now include local HTML files in your input:\n\n- Absolute paths: `/path/to/file.html`\n- File URLs: `file:///path/to/file.html`\n- Mix of local files and remote URLs in the same input\n\n## 5. Examples\n\n### 5.1. Basic Usage\n\n```bash\n# Process a single URL and print to stdout\necho \"https://example.com\" | twars-url2md --stdin\n\n# Process URLs from a file with specific output directory\ntwars-url2md --input urls.txt --output ./markdown_output\n\n# Process piped URLs with base URL for relative links\ncat urls.txt | twars-url2md --stdin --base-url \"https://example.com\" --output ./output\n\n# Show verbose output\ntwars-url2md --input urls.txt --output ./output --verbose\n```\n\n### 5.2. Using the Pack Option\n\n```bash\n# Process URLs and create a combined Markdown file\ntwars-url2md --input urls.txt --pack combined.md\n\n# Both individual files and a combined file\ntwars-url2md --input urls.txt --output ./output --pack combined.md\n```\n\n### 5.3. Processing Local Files\n\n```bash\n# Create a test HTML file\necho \"\u003chtml\u003e\u003cbody\u003e\u003ch1\u003eTest\u003c/h1\u003e\u003cp\u003eContent\u003c/p\u003e\u003c/body\u003e\u003c/html\u003e\" \u003e test.html\n\n# Process a local HTML file\necho \"$PWD/test.html\" \u003e local_paths.txt\ntwars-url2md --input local_paths.txt --output ./output\n\n# Mix local and remote content\ncat \u003e mixed.txt \u003c\u003c EOF\nhttps://example.com\nfile://$PWD/test.html\nEOF\ntwars-url2md --input mixed.txt --pack combined.md\n```\n\n### 5.4. Batch Processing\n\n```bash\n# Extract and process links from a webpage\ncurl \"https://en.wikipedia.org/wiki/Rust_(programming_language)\" | twars-url2md --stdin --output rust_wiki/\n\n# Process multiple files\nfind ./html_files -name \"*.html\" \u003e files_to_process.txt\ntwars-url2md --input files_to_process.txt --output ./markdown_output --pack all_content.md\n```\n\n## 6. Output Organization\n\nThe tool organizes output into a directory structure based on the URLs:\n\n```\noutput/\n├── example.com/\n│   ├── index.md       # from https://example.com/\n│   └── articles/\n│       └── page.md    # from https://example.com/articles/page\n└── another-site.com/\n    └── post/\n        └── article.md # from https://another-site.com/post/article\n```\n\nFor local files, the directory structure mirrors the file path.\n\n## 7. Development\n\n### 7.1. Running Tests\n\n```bash\n# Run all tests\ncargo test\n\n# Run with specific features\ncargo test --all-features\n\n# Run specific test\ncargo test test_name\n```\n\n### 7.2. Code Quality Tools\n\n- **Formatting**: `cargo fmt`\n- **Linting**: `cargo clippy --all-targets --all-features`\n\n### 7.3. Publishing\n\nTo publish a new release of twars-url2md:\n\n#### 7.3.1. Prepare for Release\n\n```bash\n# Update version in Cargo.toml (e.g. from 1.3.6 to 1.3.7)\n# Ensure everything works\ncargo test\ncargo clippy --all-targets --all-features\ncargo fmt --check\n```\n\n#### 7.3.2. Build Locally\n\n```bash\n# Build in release mode\ncargo build --release\n\n# Test the binary\n./target/release/twars-url2md --help\n```\n\n#### 7.3.3. Publish to Crates.io\n\n```bash\n# Login to crates.io (if not already logged in)\ncargo login\n\n# Verify the package\ncargo package\n\n# Publish\ncargo publish\n```\n\n#### 7.3.4. Create GitHub Release\n\n```bash\n# Create and push a tag matching your version\ngit tag -a v1.3.7 -m \"Release v1.3.7\"\ngit push origin v1.3.7\n```\n\nThe configured GitHub Actions workflow (`.github/workflows/ci.yml`) will automatically:\n- Run tests on the tag\n- Create a GitHub Release\n- Build binaries for macOS, Windows, and Linux\n- Upload the binaries to the release\n- Publish to crates.io\n\n#### 7.3.5. Manual Release (Alternative)\n\nIf GitHub Actions fails, you can create the release manually:\n\n1. Go to GitHub repository → Releases → Create a new release\n2. Select your tag\n3. Build platform-specific binaries:\n\n```bash\n# macOS universal binary\ncargo build --release --target x86_64-apple-darwin\ncargo build --release --target aarch64-apple-darwin\nlipo \"target/x86_64-apple-darwin/release/twars-url2md\" \"target/aarch64-apple-darwin/release/twars-url2md\" -create -output \"target/twars-url2md\"\ntar czf twars-url2md-macos-universal.tar.gz -C target twars-url2md\n\n# Linux\ncargo build --release --target x86_64-unknown-linux-gnu\ntar czf twars-url2md-linux-x86_64.tar.gz -C target/x86_64-unknown-linux-gnu/release twars-url2md\n\n# Windows\ncargo build --release --target x86_64-pc-windows-msvc\ncd target/x86_64-pc-windows-msvc/release\n7z a ../../../twars-url2md-windows-x86_64.zip twars-url2md.exe\n```\n\n4. Upload these files to your GitHub release\n\n#### 7.3.6. Verify the Release\n\n- Check that the release appears on GitHub\n- Verify that binary files are attached to the release\n- Confirm the new version appears on crates.io\n- Try installing the new version: `cargo install twars-url2md`\n\n## 8. License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n## 9. Author\n\nAdam Twardoch ([@twardoch](https://github.com/twardoch))\n\n---\n\nFor bug reports, feature requests, or general questions, please open an issue on the [GitHub repository](https://github.com/twardoch/twars-url2md/issues).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwardoch%2Ftwars-url2md","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftwardoch%2Ftwars-url2md","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwardoch%2Ftwars-url2md/lists"}