https://github.com/lorey/mlscraper-experiments
https://github.com/lorey/mlscraper-experiments
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/lorey/mlscraper-experiments
- Owner: lorey
- Created: 2021-03-25T20:37:33.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2021-08-06T10:58:45.000Z (almost 5 years ago)
- Last Synced: 2025-03-15T08:55:05.343Z (about 1 year ago)
- Language: HTML
- Size: 475 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# mlscraper-experiment
Trying some ideas to extend my main library [mlscraper](https://github.com/lorey/mlscraper).
Features:
* scraping arbitrary items (dict, lists, list of dicts, etc.)
* smart scraper selection
## Structure
This class diagram shows the basic relationships.

## Terminology
* Scraper: turn a page into an item by scraping HTML
* Sample: One item on a page (to be scraped later), i.e. what the user inputs
* Match: One possible occurrence of a sample, i.e. nodes in which the sample occurs
* Extractor: get the value out of a DOM node
* Selector: an algorithm to select nodes
## Does mlscraper support?
- scraping arbitary items? yes
- scraping dicts with missing values? yes
- detecting specific pages that have no results? no