An open API service indexing awesome lists of open source software.

https://github.com/kgruiz/webscraper

A Python-based web scraping tool that extracts HTML content and converts it into LaTeX format, with additional features for downloading web pages as PDFs and flattening directory structures.
https://github.com/kgruiz/webscraper

beautifulsoup html-to-latex pdf-conversion python requests weasyprint web-scraping

Last synced: 2 months ago
JSON representation

A Python-based web scraping tool that extracts HTML content and converts it into LaTeX format, with additional features for downloading web pages as PDFs and flattening directory structures.

Awesome Lists containing this project

README

        

# WebScraper-Old

> **Deprecated Notice:**
> This project has been deprecated. Please check out the improved version of the scraper at [WebScraper](https://github.com/kgruiz/WebScraper).

A Python-based web scraping tool designed to extract and convert HTML content into LaTeX format for seamless integration into documents.

## Table of Contents

- [Installation](#installation)
- [Usage](#usage)

## Installation

1. **Clone the repository:**

```bash
git clone https://github.com/kgruiz/WebScraper-Old.git
```

2. **Navigate to the project directory:**

```bash
cd WebScraper-Old
```

3. **Install the required dependencies:**

```bash
pip install requests beautifulsoup4 tqdm pypandoc weasyprint
```

## Usage

1. **Convert a single HTML file to LaTeX:**

```bash
python HTMLtoLatex.py path/to/input.html
```

2. **Download web pages as PDFs:**

```bash
python Downloader.py urlList.json
```

3. **Flatten directory structure:**

```bash
python Scraper.py
```