https://github.com/oliverwebdev/webarchiver
A powerful desktop application to download, archive, and manage web pages locally with full resource support, built with Python and PyQt6.
https://github.com/oliverwebdev/webarchiver
archive-management content-archiving data-preservation desktop-application html-editor html-parsing offline-browsing playwright pyqt6 python python-gui selenium sqlite tagging-system web-archiving web-content-management web-scraping website-archiver website-backup
Last synced: 3 months ago
JSON representation
A powerful desktop application to download, archive, and manage web pages locally with full resource support, built with Python and PyQt6.
- Host: GitHub
- URL: https://github.com/oliverwebdev/webarchiver
- Owner: Oliverwebdev
- Created: 2025-03-13T20:25:19.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-03-13T20:54:20.000Z (3 months ago)
- Last Synced: 2025-03-13T21:37:11.417Z (3 months ago)
- Topics: archive-management, content-archiving, data-preservation, desktop-application, html-editor, html-parsing, offline-browsing, playwright, pyqt6, python, python-gui, selenium, sqlite, tagging-system, web-archiving, web-content-management, web-scraping, website-archiver, website-backup
- Language: Python
- Homepage:
- Size: 159 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Website Archiver
**Preserve, manage, and customize web content offline with this powerful archiving tool.**
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
[](https://www.riverbankcomputing.com/software/pyqt/)## 🚀 Key Features
- **✨ Smart Web Capturing** - Download complete websites with all resources (images, CSS, JavaScript, fonts)
- **🔄 Multiple Engines** - Choose between standard requests, Selenium, or Playwright for perfect captures
- **📚 Bulk Archive** - Download multiple websites at once with the batch processor
- **🔍 Content Search** - Find exactly what you need with full text search across your archives
- **🏷️ Tagging System** - Organize websites with custom tags for efficient categorization
- **📝 Notes & Annotations** - Add context with your own notes for each saved website
- **✏️ Built-in Editor** - Modify archived content directly within the application
- **📦 Import/Export** - Share your archives with others or back them up securely## 🔧 Installation
### Prerequisites
- Python 3.7 or higher
- PyQt6
- Internet connection for downloading websites### Method 1: Using pip (Recommended)
```bash
# Install from PyPI
pip install website-archiver# Launch the application
website-archiver
```### Method 2: From Source
```bash
# Clone the repository
git clone https://github.com/Oliverwebdev/WebArchiver
cd website-archiver# Create and activate virtual environment (recommended)
python -m venv venv# On Windows
venv\Scripts\activate# On macOS/Linux
source venv/bin/activate# Install requirements
pip install -r requirements.txt# Run the application
python main.py
```### Optional Dependencies
For the best archiving experience, install additional engines:
```bash
# For Playwright support (recommended for complex websites)
pip install playwright
playwright install chromium# For Selenium support
pip install selenium
```## 📖 User Guide
### Archiving Your First Website
1. Launch Website Archiver
2. Go to the **Download** tab
3. Enter the URL you want to archive
4. Select your preferred download options
5. Click **Download**
6. Your archived website will appear in the **Home** tab### Managing Your Archives
- **Search**: Use the search bar to find websites by title, URL, or content
- **Filter by Tags**: Select a tag from the dropdown to filter related websites
- **Edit Website**: Click "Edit" to modify the website's content, tags, or properties
- **Add Notes**: Record your thoughts or context about why you archived the site
- **Export**: Share your archives with others using the export functionality### Customizing Your Experience
Visit the **Settings** tab to configure:
- Storage location for your archives
- Default download engine
- Resource options (images, CSS, JS, fonts)
- Timeout and concurrency settings
- And much more!## ⚙️ Technical Details
Website Archiver intelligently captures web content using a multi-step process:
1. **Analysis**: Evaluates the target website structure
2. **Download**: Retrieves HTML content using the selected engine
3. **Resource Collection**: Gathers linked resources (images, styles, scripts)
4. **Path Rewriting**: Modifies resource paths to work offline
5. **Storage**: Organizes content in a structured filesystem
6. **Indexing**: Catalogs the archive in the searchable databaseThe application architecture includes:
- `config_manager.py`: Manages application configuration
- `database_manager.py`: Handles SQLite database operations
- `scraper.py`: Core web scraping functionality
- `session_manager.py`: Manages application state between sessions
- `ui/`: PyQt6-based user interface components## 🛠️ Development
Want to contribute to Website Archiver? Great! We welcome contributions of all kinds.
### Setting Up Development Environment
1. Fork the repository
2. Clone your fork: `git clone `https://github.com/Oliverwebdev/WebArchiver
3. Create a virtual environment: `python -m venv venv`
4. Activate it and install dev dependencies: `pip install -r requirements-dev.txt`
5. Make your changes and submit a pull request!## 📜 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgements
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
- [PyQt](https://riverbankcomputing.com/software/pyqt/) for the GUI framework
- [Requests](https://requests.readthedocs.io/) for HTTP functionality
- [Selenium](https://selenium-python.readthedocs.io/) and [Playwright](https://playwright.dev/) for browser automation
- All the open source contributors who made this project possible## 🤝 Support
If you find Website Archiver useful, please consider:
- Star the repository on GitHub
- Reporting issues or suggesting features
- Contributing code or documentation improvements
- Sharing the project with othersWebsite Archiver - Because the web is too important to lose.