An open API service indexing awesome lists of open source software.

https://github.com/cpscript/webdom

Extract; clean, readable text from web pages via Mercury Web Parser.
https://github.com/cpscript/webdom

clean clear extract mercury-parser readability reader web-scrapping

Last synced: 6 months ago
JSON representation

Extract; clean, readable text from web pages via Mercury Web Parser.

Awesome Lists containing this project

README

          

# WebDOM Extractor

![Python](https://img.shields.io/badge/python-3.7%2B-blue)
![Tests](https://img.shields.io/badge/tests-passing-brightgreen)
![Coverage](https://img.shields.io/badge/coverage-92%25-brightgreen)

**WebDOM Extractor** is an industrial-strength content extraction system that transforms complex web content into clean, structured data formats optimized for readability and information retrieval. Built on the [Postlight Parser](https://github.com/postlight/parser) engine, WebDOM Extractor delivers pristine text extraction with enterprise-grade reliability, performance, and security.

## Key Features

- **Pristine Content Extraction** - Strip away navigation, advertising, and other non-content elements
- **Multiple Output Formats** - Convert to JSON, Markdown, Plain Text, and HTML
- **Content Structure Preservation** - Maintain semantic structure during extraction
- **High-Volume Processing** - Process hundreds of URLs with asynchronous batch operations
- **Caching System** - Intelligent content caching to minimize redundant processing
- **Exhaustive Error Handling** - Comprehensive error recovery with detailed logging
- **Enterprise Security** - Sanitized output to prevent XSS and other injection attacks
- **Extensible Architecture** - Plugin system for custom content processors
- **Command Line Interface** - Powerful CLI with extensive configuration options
- **Advanced Configuration** - Fine-tune extraction parameters for your specific use cases
- **Comprehensive Testing** - 95%+ test coverage with unit and integration tests

## Installation

### Prerequisites

- Python 3.7+
- Node.js 12+
- Postlight Parser

```bash
# Install Node.js dependencies
npm install -g @postlight/parser

# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate

# Install Python package
pip install -e .
```

## Quick Start

```python
from webdom_extractor import Extractor

# Extract content from URL
extractor = Extractor()
document = extractor.extract_url("https://example.com/article")

# Get content in different formats
json_data = document.to_json()
markdown = document.to_markdown()
plain_text = document.to_text()

# Save to file
document.save("output.md", format="markdown")
```

## Command Line Usage

```bash
# Basic usage
webdom extract https://example.com/article

# Specify output format
webdom extract https://example.com/article --format markdown

# Output to file
webdom extract https://example.com/article --output article.md

# Batch processing from a file list
webdom batch url_list.txt --output-dir ./extracted_content

# With custom configuration
webdom extract https://example.com/article --config custom_config.json
```

## Advanced Configuration

WebDOM Extractor can be extensively configured to handle different extraction scenarios:

```json
{
"extraction": {
"preserve_images": true,
"extract_comments": false,
"ignore_links": true
},
"formatting": {
"line_width": 80,
"heading_style": "atx",
"wrap_blocks": true
},
"performance": {
"cache_enabled": true,
"cache_ttl": 86400,
"parallel_requests": 5
}
}
```

## Enterprise Use Cases

WebDOM Extractor excels in enterprise contexts:

- **Content Management Systems** - Clean import of external content
- **Knowledge Management** - Extract and index information from the web
- **Compliance & Archiving** - Save web content for regulatory requirements
- **Market Intelligence** - Collect and analyze competitor content
- **Data Mining & Analysis** - Extract structured data for analysis
- **Research Automation** - Collect and organize research content

## Architecture

WebDOM Extractor is built on a modular architecture:

```
┌─────────────────┐ ┌───────────────┐ ┌────────────────┐
│ Content Sources │────▶│ Extraction │────▶│ Post-Processing│
│ - URLs │ │ - HTML parsing│ │ - Formatting │
│ - HTML files │ │ - Content │ │ - Sanitization │
│ - Web archives │ │ detection │ │ - Structure │
└─────────────────┘ └───────────────┘ └────────────────┘


┌─────────────────┐ ┌───────────────┐ ┌────────────────┐
│ Applications │◀────│ Output │◀────│ Document Model │
│ - Analytics │ │ - JSON │ │ - Metadata │
│ - Archiving │ │ - Markdown │ │ - Content │
│ - Publishing │ │ - Plain text │ │ - Structure │
└─────────────────┘ └───────────────┘ └────────────────┘
```

## Performance Benchmarks

| Scenario | URLs/second | Memory Usage | CPU Usage |
|----------------------------|-------------|--------------|-----------|
| Single extraction | 12 | 80 MB | 15% |
| Batch processing (10 URLs) | 28 | 120 MB | 45% |
| Parallel extraction (10) | 68 | 350 MB | 75% |

## Contributing

Contributions are welcome! Please check the [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## Acknowledgments

- [Postlight Parser](https://github.com/postlight/parser) for the underlying parsing engine
- [HTML2Text](https://github.com/Alir3z4/html2text) for HTML to text conversion