Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/caimeng2/uniscraper
A universal scraper that grabs text from multiple types of webpages.
https://github.com/caimeng2/uniscraper
text-mining web-scraper
Last synced: 3 months ago
JSON representation
A universal scraper that grabs text from multiple types of webpages.
- Host: GitHub
- URL: https://github.com/caimeng2/uniscraper
- Owner: caimeng2
- Created: 2021-03-04T23:13:45.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2023-07-13T15:33:20.000Z (over 1 year ago)
- Last Synced: 2023-07-13T16:31:51.947Z (over 1 year ago)
- Topics: text-mining, web-scraper
- Language: Jupyter Notebook
- Homepage:
- Size: 93.3 MB
- Stars: 3
- Watchers: 2
- Forks: 30
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# UniScraper
## Description
Uniscraper is a universal scraper that collects text from multiple types of webpages. Currently it supports html (including dynamic webpages that use javascript), online pdfs, word documents, presentation slides, and spreadsheets.
## Installation instructions
Clone the git repo:
git clone https://github.com/caimeng2/UniScraper.git
Set up a conda environment by running the following command:conda env create --prefix ./envs --file environment.yml
conda activate ./envs
## Dependency
`bs4` `webdriver_manager` `pandas` `selenium` `requests` `python-docx` `python-pptx` `pdfminer`
## Example usage
Please run `example.ipynb` to see example usage.