Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lorey/mlscraper-experiments


https://github.com/lorey/mlscraper-experiments

Last synced: 30 days ago
JSON representation

Awesome Lists containing this project

README

        

# mlscraper-experiment

Trying some ideas to extend my main library [mlscraper](https://github.com/lorey/mlscraper).

Features:

* scraping arbitrary items (dict, lists, list of dicts, etc.)
* smart scraper selection

## Structure
This class diagram shows the basic relationships.

![class diagram](docs/classes.png)

## Terminology
* Scraper: turn a page into an item by scraping HTML
* Sample: One item on a page (to be scraped later), i.e. what the user inputs
* Match: One possible occurrence of a sample, i.e. nodes in which the sample occurs
* Extractor: get the value out of a DOM node
* Selector: an algorithm to select nodes

## Does mlscraper support?
- scraping arbitary items? yes
- scraping dicts with missing values? yes
- detecting specific pages that have no results? no