https://github.com/kgruiz/webscraper

A Python-based web scraping tool that extracts HTML content and converts it into LaTeX format, with additional features for downloading web pages as PDFs and flattening directory structures.
https://github.com/kgruiz/webscraper

beautifulsoup html-to-latex pdf-conversion python requests weasyprint web-scraping

Last synced: 2 months ago
JSON representation

A Python-based web scraping tool that extracts HTML content and converts it into LaTeX format, with additional features for downloading web pages as PDFs and flattening directory structures.

Host: GitHub
URL: https://github.com/kgruiz/webscraper
Owner: kgruiz
Created: 2024-11-29T01:50:12.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-03-21T19:22:20.000Z (2 months ago)
Last Synced: 2025-03-21T20:26:20.479Z (2 months ago)
Topics: beautifulsoup, html-to-latex, pdf-conversion, python, requests, weasyprint, web-scraping
Language: HTML
Homepage:
Size: 19.4 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# WebScraper-Old

> **Deprecated Notice:**
> This project has been deprecated. Please check out the improved version of the scraper at [WebScraper](https://github.com/kgruiz/WebScraper).

A Python-based web scraping tool designed to extract and convert HTML content into LaTeX format for seamless integration into documents.

## Table of Contents

- [Installation](#installation)
- [Usage](#usage)

## Installation

1. **Clone the repository:**

```bash
git clone https://github.com/kgruiz/WebScraper-Old.git
```

2. **Navigate to the project directory:**

```bash
cd WebScraper-Old
```

3. **Install the required dependencies:**

```bash
pip install requests beautifulsoup4 tqdm pypandoc weasyprint
```

## Usage

1. **Convert a single HTML file to LaTeX:**

```bash
python HTMLtoLatex.py path/to/input.html
```

2. **Download web pages as PDFs:**

```bash
python Downloader.py urlList.json
```

3. **Flatten directory structure:**

```bash
python Scraper.py
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kgruiz/webscraper

Awesome Lists containing this project

README