{"id":22895504,"url":"https://github.com/kgruiz/webscraper","last_synced_at":"2025-03-31T22:47:54.951Z","repository":{"id":265385182,"uuid":"895826801","full_name":"kgruiz/WebScraper-Old","owner":"kgruiz","description":"A Python-based web scraping tool that extracts HTML content and converts it into LaTeX format, with additional features for downloading web pages as PDFs and flattening directory structures.","archived":false,"fork":false,"pushed_at":"2025-03-21T19:22:20.000Z","size":20332,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-21T20:26:20.479Z","etag":null,"topics":["beautifulsoup","html-to-latex","pdf-conversion","python","requests","weasyprint","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kgruiz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-29T01:50:12.000Z","updated_at":"2025-03-21T19:22:36.000Z","dependencies_parsed_at":"2024-11-29T06:30:05.825Z","dependency_job_id":"2b67e41d-085c-46c7-a04b-1b985097ed95","html_url":"https://github.com/kgruiz/WebScraper-Old","commit_stats":null,"previous_names":["kgruiz/webscraper","kgruiz/webscraper-old"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kgruiz%2FWebScraper-Old","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kgruiz%2FWebScraper-Old/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kgruiz%2FWebScraper-Old/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kgruiz%2FWebScraper-Old/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kgruiz","download_url":"https://codeload.github.com/kgruiz/WebScraper-Old/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246552886,"owners_count":20795837,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","html-to-latex","pdf-conversion","python","requests","weasyprint","web-scraping"],"created_at":"2024-12-13T23:29:55.809Z","updated_at":"2025-03-31T22:47:54.947Z","avatar_url":"https://github.com/kgruiz.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WebScraper-Old\n\n\u003e **Deprecated Notice:**\n\u003e This project has been deprecated. Please check out the improved version of the scraper at [WebScraper](https://github.com/kgruiz/WebScraper).\n\nA Python-based web scraping tool designed to extract and convert HTML content into LaTeX format for seamless integration into documents.\n\n## Table of Contents\n\n- [Installation](#installation)\n- [Usage](#usage)\n\n## Installation\n\n1. **Clone the repository:**\n\n   ```bash\n   git clone https://github.com/kgruiz/WebScraper-Old.git\n   ```\n\n2. **Navigate to the project directory:**\n\n   ```bash\n   cd WebScraper-Old\n   ```\n\n3. **Install the required dependencies:**\n\n   ```bash\n   pip install requests beautifulsoup4 tqdm pypandoc weasyprint\n   ```\n\n## Usage\n\n1. **Convert a single HTML file to LaTeX:**\n\n   ```bash\n   python HTMLtoLatex.py path/to/input.html\n   ```\n\n2. **Download web pages as PDFs:**\n\n   ```bash\n   python Downloader.py urlList.json\n   ```\n\n3. **Flatten directory structure:**\n\n   ```bash\n   python Scraper.py\n   ```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkgruiz%2Fwebscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkgruiz%2Fwebscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkgruiz%2Fwebscraper/lists"}