https://github.com/kgruiz/webscraper
A Python-based web scraping tool that extracts HTML content and converts it into LaTeX format, with additional features for downloading web pages as PDFs and flattening directory structures.
https://github.com/kgruiz/webscraper
beautifulsoup html-to-latex pdf-conversion python requests weasyprint web-scraping
Last synced: 2 months ago
JSON representation
A Python-based web scraping tool that extracts HTML content and converts it into LaTeX format, with additional features for downloading web pages as PDFs and flattening directory structures.
- Host: GitHub
- URL: https://github.com/kgruiz/webscraper
- Owner: kgruiz
- Created: 2024-11-29T01:50:12.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-03-21T19:22:20.000Z (2 months ago)
- Last Synced: 2025-03-21T20:26:20.479Z (2 months ago)
- Topics: beautifulsoup, html-to-latex, pdf-conversion, python, requests, weasyprint, web-scraping
- Language: HTML
- Homepage:
- Size: 19.4 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# WebScraper-Old
> **Deprecated Notice:**
> This project has been deprecated. Please check out the improved version of the scraper at [WebScraper](https://github.com/kgruiz/WebScraper).A Python-based web scraping tool designed to extract and convert HTML content into LaTeX format for seamless integration into documents.
## Table of Contents
- [Installation](#installation)
- [Usage](#usage)## Installation
1. **Clone the repository:**
```bash
git clone https://github.com/kgruiz/WebScraper-Old.git
```2. **Navigate to the project directory:**
```bash
cd WebScraper-Old
```3. **Install the required dependencies:**
```bash
pip install requests beautifulsoup4 tqdm pypandoc weasyprint
```## Usage
1. **Convert a single HTML file to LaTeX:**
```bash
python HTMLtoLatex.py path/to/input.html
```2. **Download web pages as PDFs:**
```bash
python Downloader.py urlList.json
```3. **Flatten directory structure:**
```bash
python Scraper.py
```