Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/caimeng2/uniscraper

A universal scraper that grabs text from multiple types of webpages.
https://github.com/caimeng2/uniscraper

text-mining web-scraper

Last synced: 3 months ago
JSON representation

A universal scraper that grabs text from multiple types of webpages.

Awesome Lists containing this project

README

        

# UniScraper

## Description

Uniscraper is a universal scraper that collects text from multiple types of webpages. Currently it supports html (including dynamic webpages that use javascript), online pdfs, word documents, presentation slides, and spreadsheets.

## Installation instructions

Clone the git repo:

git clone https://github.com/caimeng2/UniScraper.git

Set up a conda environment by running the following command:

conda env create --prefix ./envs --file environment.yml

conda activate ./envs

## Dependency

`bs4` `webdriver_manager` `pandas` `selenium` `requests` `python-docx` `python-pptx` `pdfminer`

## Example usage

Please run `example.ipynb` to see example usage.