Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lorey/mlscraper-experiments
https://github.com/lorey/mlscraper-experiments
Last synced: 30 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/lorey/mlscraper-experiments
- Owner: lorey
- Created: 2021-03-25T20:37:33.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2021-08-06T10:58:45.000Z (over 3 years ago)
- Last Synced: 2023-02-28T15:21:57.802Z (over 1 year ago)
- Language: HTML
- Size: 475 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# mlscraper-experiment
Trying some ideas to extend my main library [mlscraper](https://github.com/lorey/mlscraper).
Features:
* scraping arbitrary items (dict, lists, list of dicts, etc.)
* smart scraper selection## Structure
This class diagram shows the basic relationships.![class diagram](docs/classes.png)
## Terminology
* Scraper: turn a page into an item by scraping HTML
* Sample: One item on a page (to be scraped later), i.e. what the user inputs
* Match: One possible occurrence of a sample, i.e. nodes in which the sample occurs
* Extractor: get the value out of a DOM node
* Selector: an algorithm to select nodes## Does mlscraper support?
- scraping arbitary items? yes
- scraping dicts with missing values? yes
- detecting specific pages that have no results? no