https://github.com/mazzasaverio/url2md4ai
Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.
https://github.com/mazzasaverio/url2md4ai
html-to-markdown html-to-markdown-converter openai playwright text-extraction trafilatura
Last synced: 11 months ago
JSON representation
Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.
- Host: GitHub
- URL: https://github.com/mazzasaverio/url2md4ai
- Owner: mazzasaverio
- License: other
- Created: 2025-06-29T09:20:05.000Z (11 months ago)
- Default Branch: master
- Last Pushed: 2025-07-01T13:42:26.000Z (11 months ago)
- Last Synced: 2025-07-01T13:56:51.743Z (11 months ago)
- Topics: html-to-markdown, html-to-markdown-converter, openai, playwright, text-extraction, trafilatura
- Language: Python
- Homepage:
- Size: 213 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐ url2md4ai





**๐ฏ Lean Python tool for extracting clean, LLM-optimized markdown from web pages**
Perfect for AI applications that need high-quality text extraction from both static and dynamic web content. Combines **Playwright** for JavaScript rendering with **Trafilatura** for intelligent content extraction, delivering markdown specifically optimized for LLM processing and information extraction.
## ๐ฏ Why url2md4ai?
**Traditional tools** extract everything: ads, cookie banners, navigation menus, social media widgets...
**url2md4ai** extracts only what matters: clean, structured content ready for LLM processing.
```bash
# Example: Extract job posting from Satispay careers page
url2md4ai convert "https://www.satispay.com/careers/job-posting" --show-metadata
# Result: 97% noise reduction (from 51KB to 9KB)
# โ
Clean job title, description, requirements, benefits
# โ No cookie banners, ads, or navigation clutter
```
**Perfect for:**
- ๐ค AI content analysis workflows
- ๐ LLM-based information extraction
- ๐ Web scraping for research and analysis
- ๐ Content preprocessing for RAG systems
- ๐ฏ Automated content monitoring
## โจ Features
### ๐ฏ **LLM-Optimized Text Extraction**
- **๐ง Smart Content Extraction**: Powered by Trafilatura for intelligent text extraction
- **๐ Dynamic Content Support**: Full JavaScript rendering with Playwright for SPAs and dynamic sites
- **๐งน Clean Output**: Removes ads, cookie banners, navigation, and other noise for pure content
- **๐ Maximum Information Density**: Optimized markdown specifically designed for LLM processing
### โก **Lean & Efficient**
- **๐ฏ Focused Purpose**: Built specifically for AI/LLM text extraction workflows
- **โก Fast Processing**: Optional non-JavaScript mode for static content (3x faster)
- **๐ง CLI-First**: Simple command-line interface for batch processing and automation
- **๐ Python API**: Clean programmatic access for integration into AI pipelines
### ๐ ๏ธ **Production Ready**
- **๐ Smart Filenames**: Generate unique, deterministic filenames using URL hashes
- **๐ Batch Processing**: Parallel processing support for multiple URLs
- **๐๏ธ Configurable**: Extensive configuration options for different content types
- **๐ Reliable**: Built-in retry logic and error handling
## ๐ Quick Start
### Using uv (Recommended)
```bash
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and install
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
uv sync
# Install Playwright browsers
uv run playwright install chromium
# Convert your first URL
uv run url2md4ai convert "https://example.com"
```
### Using pip
```bash
pip install url2md4ai
playwright install chromium
url2md4ai convert "https://example.com"
```
### Using Docker
```bash
# Build the image
docker build -t url2md4ai .
# Run with URL conversion
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
convert "https://example.com"
```
## ๐ Usage
### CLI Commands
#### Basic Conversion
```bash
# Convert a single URL (with metadata)
url2md4ai convert "https://example.com" --show-metadata
# Convert with custom output file
url2md4ai convert "https://example.com" -o my_page.md
# Convert without JavaScript (3x faster for static content)
url2md4ai convert "https://example.com" --no-js
# Raw extraction (no LLM optimization)
url2md4ai convert "https://example.com" --raw
# Get both HTML and Markdown
url2md4ai convert "https://example.com" --raw --save-html --output-dir raw_content # Get raw HTML
url2md4ai convert "https://example.com" --clean --output-dir clean_content # Get clean markdown
```
#### Batch Processing
```bash
# Convert multiple URLs with parallel processing
url2md4ai batch "https://site1.com" "https://site2.com" "https://site3.com" --concurrency 5
# Continue processing even if some URLs fail
url2md4ai batch "https://site1.com" "https://site2.com" --continue-on-error
# Custom output directory
url2md4ai batch "https://example.com" -d /path/to/output
```
#### Preview and Utilities
```bash
# Preview conversion without saving
url2md4ai preview "https://example.com" --show-content
# Test different extraction methods
url2md4ai test-extraction "https://example.com" --method both --show-diff
# Generate hash filename for URL
url2md4ai hash "https://example.com"
# Show current configuration
url2md4ai config-info --format json
```
### Python API
```python
from url2md4ai import URLToMarkdownConverter, Config
# Initialize converter
config = Config.from_env()
converter = URLToMarkdownConverter(config)
# Convert URL synchronously (perfect for LLM pipelines)
result = converter.convert_url_sync("https://example.com")
if result.success:
print(f"๐ Title: {result.title}")
print(f"๐ Saved as: {result.filename}")
print(f"๐ Size: {result.file_size:,} characters")
print(f"โก Method: {result.extraction_method}")
print(f"โฑ๏ธ Processing time: {result.processing_time:.2f}s")
# Use extracted content for LLM processing
llm_ready_content = result.markdown
print("๐ง LLM-ready content extracted successfully!")
else:
print(f"โ Error: {result.error}")
# Convert URL asynchronously
import asyncio
async def convert_url():
result = await converter.convert_url("https://example.com")
return result
result = asyncio.run(convert_url())
# Get both HTML and Markdown from a URL
async def get_html_and_markdown():
# Initialize converter with raw HTML option
config = Config(
clean_content=False, # Get raw HTML
llm_optimized=False, # No extra processing
wait_for_network_idle=True, # Wait for dynamic content
page_wait_timeout=2000 # Wait 2s for dynamic content
)
converter = URLToMarkdownConverter(config)
# Get raw HTML first
result = await converter.convert_url(
"https://example.com",
save_to_file=False # Don't save to file
)
raw_html = result.html
# Now get clean markdown with optimizations
config.clean_content = True
config.llm_optimized = True
converter = URLToMarkdownConverter(config)
result = await converter.convert_url(
"https://example.com",
save_to_file=True # Save markdown to file
)
clean_markdown = result.markdown
return {
"html": raw_html,
"markdown": clean_markdown,
"title": result.title,
"metadata": result.metadata
}
# Use the function
result = asyncio.run(get_html_and_markdown())
print(f"๐ HTML size: {len(result['html']):,} characters")
print(f"๐ Markdown size: {len(result['markdown']):,} characters")
print(f"๐ท๏ธ Title: {result['title']}")
#### Advanced Usage
```python
from url2md4ai import URLToMarkdownConverter, Config, URLHasher
# Custom configuration for specific content types
config = Config(
timeout=60,
wait_for_network_idle=True, # Wait for dynamic content
page_wait_timeout=2000, # Wait 2s for dynamic content
clean_content=True, # Remove ads/banners
llm_optimized=True, # Optimize for LLM processing
remove_cookie_banners=True,
remove_navigation=True,
remove_ads=True,
remove_social_media=True,
remove_comments=True,
output_dir="ai_content",
user_agent="MyAI/1.0"
)
converter = URLToMarkdownConverter(config)
# Convert with maximum cleaning for LLM processing
result = await converter.convert_url(
url="https://example.com",
use_trafilatura=True, # Use intelligent extraction
use_javascript=True, # Handle dynamic content
favor_precision=True, # Prefer precision over recall
include_tables=True, # Include table content
include_images=False, # Exclude image references
include_formatting=True # Preserve text formatting
)
if result.success:
# Perfect for feeding into LLMs
clean_content = result.markdown
metadata = result.metadata
print(f"๐ฏ Extraction quality: {result.extraction_method}")
print(f"๐ Content size: {result.file_size:,} chars")
print(f"๐งน Cleaned and ready for LLM processing!")
# Generate deterministic filenames
hash_value = URLHasher.generate_hash("https://example.com")
filename = URLHasher.generate_filename("https://example.com")
print(f"๐ Hash: {hash_value}, ๐ Filename: {filename}")
```
## ๐ Extraction Quality Examples
### Before vs After: Real-World Results
```bash
# Complex job posting with cookie banners and ads
url2md4ai convert "https://company.com/careers/position" --show-metadata
```
**Before (Raw HTML):** 51KB, 797 lines
- โ Cookie consent banners
- โ Website navigation
- โ Social media widgets
- โ Advertising content
- โ Footer links and legal text
**After (url2md4ai):** 9KB, 69 lines
- โ
Job title and description
- โ
Key requirements
- โ
Company benefits
- โ
Application process
- โ
**97% noise reduction!**
### Content Types Optimized for LLM
| Content Type | Extraction Quality | Best Settings |
|--------------|-------------------|---------------|
| **News Articles** | โญโญโญโญโญ | `--no-js` (faster) |
| **Job Postings** | โญโญโญโญโญ | `--force-js` (complete) |
| **Product Pages** | โญโญโญโญ | `--clean` (essential) |
| **Documentation** | โญโญโญโญโญ | `--raw` (preserve structure) |
| **Blog Posts** | โญโญโญโญโญ | default settings |
| **Social Media** | โญโญโญ | `--force-js` required |
## โ๏ธ Configuration
### Environment Variables
```bash
# Content Extraction Settings
export URL2MD_CLEAN_CONTENT=true
export URL2MD_LLM_OPTIMIZED=true
export URL2MD_USE_TRAFILATURA=true
# Dynamic Content Settings
export URL2MD_WAIT_NETWORK=true
export URL2MD_PAGE_TIMEOUT=2000
export URL2MD_HEADLESS=true
# Content Filtering
export URL2MD_REMOVE_COOKIES=true
export URL2MD_REMOVE_NAV=true
export URL2MD_REMOVE_ADS=true
export URL2MD_REMOVE_SOCIAL=true
export URL2MD_REMOVE_COMMENTS=true
# Advanced Settings
export URL2MD_FAVOR_PRECISION=true
export URL2MD_INCLUDE_TABLES=true
export URL2MD_INCLUDE_IMAGES=false
export URL2MD_INCLUDE_FORMATTING=true
# Output Settings
export URL2MD_OUTPUT_DIR="output"
export URL2MD_USE_HASH_FILENAMES=true
# Performance & Reliability
export URL2MD_TIMEOUT=30
export URL2MD_MAX_RETRIES=3
export URL2MD_USER_AGENT="url2md4ai/1.0"
```
### Configuration Options
| Option | Default | Description |
|--------|---------|-------------|
| **Content Extraction** | | |
| `clean_content` | true | Remove ads, banners, navigation |
| `llm_optimized` | true | Post-process for LLM consumption |
| `use_trafilatura` | true | Use intelligent text extraction |
| **Dynamic Content** | | |
| `wait_for_network_idle` | true | Wait for network activity to finish |
| `page_wait_timeout` | 2000 | Wait time for dynamic content (ms) |
| `browser_headless` | true | Run browser in headless mode |
| **Content Filtering** | | |
| `remove_cookie_banners` | true | Remove cookie consent UI |
| `remove_navigation` | true | Remove nav menus and headers |
| `remove_ads` | true | Remove advertising content |
| `remove_social_media` | true | Remove social sharing widgets |
| `remove_comments` | true | Remove user comments |
| **Advanced Settings** | | |
| `favor_precision` | true | Prefer precision over recall |
| `include_tables` | true | Include table content |
| `include_images` | false | Include image references |
| `include_formatting` | true | Preserve text formatting |
| **Output Settings** | | |
| `output_dir` | "output" | Default output directory |
| `use_hash_filenames` | true | Generate deterministic filenames |
## ๐ณ Docker Usage
๐ **See [DOCKER_USAGE.md](DOCKER_USAGE.md) for comprehensive Docker usage examples and troubleshooting.**
### Quick Start with Docker
```bash
# Build the image
docker build -t url2md4ai .
# Convert single URL with LLM optimization
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
convert "https://example.com" --show-metadata
# Convert dynamic content with JavaScript rendering
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
convert "https://spa-app.com" --force-js --show-metadata
# Batch processing with parallel workers
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
batch "https://site1.com" "https://site2.com" --concurrency 5 --show-metadata
```
### Using Docker Compose (Recommended)
```bash
# Start with compose for easier management
docker compose run --rm url2md4ai convert "https://example.com" --show-metadata
# Development mode with full environment
docker compose run --rm dev
# Batch processing example
docker compose run --rm url2md4ai \
batch "https://news.site.com/article1" "https://blog.site.com/post2" \
--concurrency 3 --continue-on-error --show-metadata
```
### Custom Configuration
```bash
# Override LLM optimization settings
docker run --rm \
-v $(pwd)/output:/app/output \
-e URL2MD_CLEAN_CONTENT=false \
-e URL2MD_LLM_OPTIMIZED=false \
url2md4ai \
convert "https://example.com" --raw
# Disable JavaScript for faster processing
docker run --rm \
-v $(pwd)/output:/app/output \
-e URL2MD_JAVASCRIPT=false \
url2md4ai \
convert "https://static-site.com" --no-js
```
## ๐ ๏ธ Development
### Setup Development Environment
```bash
# Clone repository
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
# Install with uv
uv sync
# Install Playwright browsers
uv run playwright install
# Run tests
uv run pytest
# Run linting
uv run ruff check
uv run black --check .
```
### Running Tests
```bash
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src/url2md4ai
# Run specific test
uv run pytest tests/test_converter.py
```
## ๐ Output Format
The tool generates clean, LLM-optimized markdown with:
- โ
Preserved heading structure
- โ
Clean link formatting
- โ
Removed navigation, footer, and sidebar content
- โ
Optimized whitespace and line breaks
- โ
Title and metadata preservation
- โ
Support for complex layouts
### Example Output
```markdown
# Page Title
Main content paragraph with [links](https://example.com) preserved.
## Section Heading
- List items preserved
- Proper formatting maintained
**Bold text** and *italic text* converted correctly.
> Blockquotes maintained
```code blocks preserved```
```
## ๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
### Development Guidelines
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
### Code Quality
- Use `black` for code formatting
- Use `ruff` for linting
- Add type hints for all functions
- Write tests for new features
- Update documentation as needed
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Acknowledgments
- [Trafilatura](https://trafilatura.readthedocs.io/) for intelligent content extraction and web scraping
- [Playwright](https://playwright.dev/) for JavaScript rendering and dynamic content handling
- [html2text](https://github.com/Alir3z4/html2text) for HTML to Markdown conversion
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing and content cleaning
- [Click](https://click.palletsprojects.com/) for the powerful CLI interface
- [Loguru](https://github.com/Delgan/loguru) for elegant logging
## ๐ Roadmap
- [ ] Support for more output formats (PDF, DOCX)
- [ ] Custom CSS selector filtering
- [ ] Integration with popular LLM APIs
- [ ] Web UI interface
- [ ] Plugin system for custom processors
- [ ] Support for authentication-required pages
---